Pandas - BKJackson/BKJackson_Wiki GitHub Wiki

Fast looping through Pandas

Use df.itertuples() not df.iterrows()

%%time

df = pd.DataFrame(data={'Values': range(1,100000)}  
sum = 0
for row in df.itertuples(index=False):
    sum += row.Values

print(sum)   

Wall time: 99.8 ms

Merge two dataframes

Source

df = pd.merge(df, geoloc, how='inner', on=HYDRO_CODE_COL, sort=False, copy=True, indicator=False, validate=None)  

Calculate number of nans in each column

df.isna().sum()  

View data types of each column

df.dtypes  

Flatten json file into a Pandas dataframe

From Flatten Nested JSON in Pandas

import json 
import pandas as pd 
from pandas.io.json import json_normalize  

with open('../input/raw_nyc_phil.json') as f:
    d = json.load(f)

nycphil = json_normalize(d['programs'])  

Set a dataframe column to categorical type and then get summary stats for other columns

iris['species'] = iris['species'].astype('category')  

values = ['count', 'min', 'max', 'mean', 'std']
iris.groupby(by='species').agg(values)  

Pandas String Methods

Methods similar to Python string methods

Nearly all Python's built-in string methods are mirrored by a Pandas vectorized string method. Here is a list of Pandas str methods that mirror Python string methods:

len() lower() translate() islower()
ljust() upper() startswith() isupper()
rjust() find() endswith() isnumeric()
center() rfind() isalnum() isdecimal()
zfill() index() isalpha() split()
strip() rindex() isdigit() rsplit()
rstrip() capitalize() isspace() partition()
lstrip() swapcase() istitle() rpartition()

Apply using str attribute:

names = pd.Series(data)  
names.str.capitalize()

Get recipe name with the longest ingredient list using np.argmax()

From the recipes dataframe:

recipes.name[np.argmax(recipes.ingredients.str.len())]

Count how many of the recipes are for breakfast food

recipes.description.str.contains('[Bb]reakfast').sum()

Map words to numbers in a dict using map

# Inspect payment type
df_train.payment_type.str.lower().value_counts(progress=True)  

map_payment_type = {'crd':1, 'cash':2, '1':1, 'cas':2, '2':2, 'no charge':3, 'no':3, '3':3}  

df['payment_type_num'] = df_train.payment_type.str.lower().map(map_payment_type, default_value=-1, allow_missing=True)

Working with Time Series in Pandas

Pandas date and time data comes in a few flavors:

  • Time stamps reference particular moments in time (e.g., July 4th, 2015 at 7:00am). Pandas provides the Timestamp type. The associated Index structure is DatetimeIndex.
  • Time intervals and periods reference a length of time between a particular beginning and end point; for example, the year 2015. Periods usually reference a special case of time intervals in which each interval is of uniform length and does not overlap (e.g., 24 hour-long periods comprising days). This encodes a fixed-frequency interval based on numpy.datetime64. Pandas provides the Period type. The associated index structure is PeriodIndex.
  • Time deltas or durations reference an exact length of time (e.g., a duration of 22.56 seconds). Pandas provides the Timedelta type. The associated index structure is TimedeltaIndex.

Native Python datetime and dateutil

Manually build a date using the datetime type

from datetime import datetime
datetime(year=2015, month=7, day=4)

Output: datetime.datetime(2015, 7, 4, 0, 0)

Using the dateutil module, you can parse dates from a variety of string formats

from dateutil import parser
date = parser.parse("4th of July, 2015")
date  

Output: datetime.datetime(2015, 7, 4, 0, 0)

Once you have a datetime object, you can do things like printing the day of the week

date.strftime('%A') 

Output: 'Saturday'
Datetime docs for strftime() and strptime() Format codes: Link

Typed arrays of times: NumPy's datetime64

The weaknesses of Python's datetime format inspired the NumPy team to add a set of native time series data type to NumPy. The datetime64 dtype encodes dates as 64-bit integers, and thus allows arrays of dates to be represented very compactly. The datetime64 requires a very specific input format:

import numpy as np
date = np.array('2015-07-04', dtype=np.datetime64)
date

Output: array(datetime.date(2015, 7, 4), dtype='datetime64[D]')

Once we have this date formatted, however, we can quickly do vectorized operations on it:

date + np.arange(12)

Output: array(['2015-07-04', '2015-07-05', '2015-07-06', '2015-07-07', '2015-07-08', '2015-07-09', '2015-07-10', '2015-07-11', '2015-07-12', '2015-07-13', '2015-07-14', '2015-07-15'], dtype='datetime64[D]')

You can force any desired fundamental unit using one of many format codes; for example, here we'll force a nanosecond-based time

np.datetime64('2015-07-04 12:59:59.50', 'ns')  

Output: numpy.datetime64('2015-07-04T12:59:59.500000000')

The following table, drawn from the NumPy datetime64 documentation, lists the available format codes along with the relative and absolute timespans that they can encode:

Code Meaning Time span (relative) Time span (absolute)
Y Year ± 9.2e18 years [9.2e18 BC, 9.2e18 AD]
M Month ± 7.6e17 years [7.6e17 BC, 7.6e17 AD]
W Week ± 1.7e17 years [1.7e17 BC, 1.7e17 AD]
D Day ± 2.5e16 years [2.5e16 BC, 2.5e16 AD]
h Hour ± 1.0e15 years [1.0e15 BC, 1.0e15 AD]
m Minute ± 1.7e13 years [1.7e13 BC, 1.7e13 AD]
s Second ± 2.9e12 years [ 2.9e9 BC, 2.9e9 AD]
ms Millisecond ± 2.9e9 years [ 2.9e6 BC, 2.9e6 AD]
us Microsecond ± 2.9e6 years [290301 BC, 294241 AD]
ns Nanosecond ± 292 years [ 1678 AD, 2262 AD]
ps Picosecond ± 106 days [ 1969 AD, 1970 AD]
fs Femtosecond ± 2.6 hours [ 1969 AD, 1970 AD]
as Attosecond ± 9.2 seconds [ 1969 AD, 1970 AD]

Working with time zones

pytz - contains tools for working with the most migrane-inducing piece of time series data: time zones.

Setting time-based index

index = pd.DatetimeIndex(['2014-07-04', '2014-08-04',
                          '2015-07-04', '2015-08-04'])
data = pd.Series([0, 1, 2, 3], index=index)
data

Output:
2014-07-04 0
2014-08-04 1
2015-07-04 2
2015-08-04 3
dtype: int64

Create a regular date range sequence with pd.date_range()

pd.date_range() accepts a start date, an end date, and an optional frequency code to create a regular sequence of dates. By default, the frequency is one day.

pd.date_range('2015-07-03', '2015-07-10')  

Output: DatetimeIndex(['2015-07-03', '2015-07-04', '2015-07-05', '2015-07-06', '2015-07-07', '2015-07-08', '2015-07-09', '2015-07-10'], dtype='datetime64[ns]', freq='D')

Here we will construct a range of hourly timestamps.

pd.date_range('2015-07-03', periods=8, freq='H')  

Output:
DatetimeIndex(['2015-07-03 00:00:00', '2015-07-03 01:00:00', '2015-07-03 02:00:00', '2015-07-03 03:00:00', '2015-07-03 04:00:00', '2015-07-03 05:00:00', '2015-07-03 06:00:00', '2015-07-03 07:00:00'], dtype='datetime64[ns]', freq='H')

Regular sequences of Period or Timedelta values

To create regular sequences of Period or Timedelta values, the very similar pd.period_range() and pd.timedelta_range() functions are useful.

Monthly periods

pd.period_range('2015-07', periods=8, freq='M')

Output:
PeriodIndex(['2015-07', '2015-08', '2015-09', '2015-10', '2015-11', '2015-12', '2016-01', '2016-02'], dtype='int64', freq='M')

A sequence of durations increasing by an hour

pd.timedelta_range(0, periods=10, freq='H')

Output:
TimedeltaIndex(['00:00:00', '01:00:00', '02:00:00', '03:00:00', '04:00:00', '05:00:00', '06:00:00', '07:00:00', '08:00:00', '09:00:00'], dtype='timedelta64[ns]', freq='H')

for a frequency of 2 hours 30 minutes, we can combine the hour (H) and minute (T) codes as follows

pd.timedelta_range(0, periods=9, freq="2H30T")  

TimedeltaIndex(['00:00:00', '02:30:00', '05:00:00', '07:30:00', '10:00:00', '12:30:00', '15:00:00', '17:30:00', '20:00:00'], dtype='timedelta64[ns]', freq='150T')

All of these short codes refer to specific instances of Pandas time series offsets, which can be found in the pd.tseries.offsets module.

H = hours T = minutes
S = seconds

Another way:

data['date'] = data.date.astype('datetime64[ns]')
data = data.set_index('date')

Another way:

df.index = pd.to_datetime(df.index)  

Get index type

type(df.index)  

Convert any DatetimeIndex to a PeriodIndex with to_period()

dates = pd.to_datetime([datetime(2015, 7, 3), '4th of July, 2015',
                       '2015-Jul-6', '07-07-2015', '20150708'])

dates.to_period('D')    

Output: PeriodIndex(['2015-07-03', '2015-07-04', '2015-07-06', '2015-07-07', '2015-07-08'], dtype='int64', freq='D')

Create a TimedeltaIndex when a date is subtracted from another

dates - dates[0]  

Output: TimedeltaIndex(['0 days', '1 days', '3 days', '4 days', '5 days'], dtype='timedelta64[ns]', freq=None)

Reorganize a dataframe where time is the index, column names correspond to a categorical column, and mean values are a column of their own

df = df.pivot(index='Date', columns='Source', values='Mean')  

Scatter plot to see how to measurements correlate over time

plt.scatter(df['1880':'1900']['GCAG'](/BKJackson/BKJackson_Wiki/wiki/'GCAG'), df['1880':'1900']['GISTEMP'](/BKJackson/BKJackson_Wiki/wiki/'GISTEMP'))  

Pandas data I/O

Reading multiple files in one line using Pandas and glob

import glob
import pandas as pd
df = pd.concat([pd.read_csv(f, encoding='latin1') for f in glob.glob('data*.csv'), ignore_index=True])

Pandas Visualization

Quick and easy plot of first 100 samples of a column

df.Mean[:100].plot()  

Plot two columns within a time index range

plt.plot(df['1880':'1950']['GCAG', 'GISTEMP'](/BKJackson/BKJackson_Wiki/wiki/'GCAG',-'GISTEMP'))  

Plotting categories with pandas Categorical

from pandas import Categorical

def categorify(df, keys):
    if isinstance(keys, str):
        keys = (keys, )
    
    for key in keys:
        cat_map = {cat: i for i, cat in enumerate(set(df[key]))}
        df[key] = df[key].apply(lambda x: cat_map[x])

Example usage:

df = pd.read_csv('../data/iris.csv')
lib.categorify(df, 'species')
df.plot.scatter('sepal_width', 'sepal_length', c='species', colormap='viridis')  

Pandas plot with mask, groupby, sum, and sum

nba[nba["fran_id"] == "Knicks"].groupby("year_id")["pts"].sum().plot()

Index as an ordered set

Pandas objects are designed to facilitate operations such as joins across datasets, which depend on many aspects of set arithmetic. The Index object follows many of the conventions used by Python's built-in set data structure, so that unions, intersections, differences, and other combinations can be computed in a familiar way:

indA = pd.Index([1, 3, 5, 7, 9])
indB = pd.Index([2, 3, 5, 7, 11])  

indA & indB  # intersection  
output: Int64Index([3, 5, 7], dtype='int64')  

indA | indB  # union
output: Int64Index([1, 2, 3, 5, 7, 9, 11], dtype='int64')  

indA ^ indB  # symmetric difference  
Int64Index([1, 2, 9, 11], dtype='int64')  

Rolling windows

Rolling mean and standard deviation

rolling = goog.rolling(365, center=True)

data = pd.DataFrame({'input': goog,
                     'one-year rolling_mean': rolling.mean(),
                     'one-year rolling_std': rolling.std()})
data['mean_plus'] = data['one-year rolling_mean'] + rolling.std()
data['mean_minus'] = data['one-year rolling_mean'] - rolling.std()
ax = data.plot(style=['-', '--', ':', ':', ':'])
ax.lines[0].set_alpha(0.3)

Pandas tips and articles

Useful Pandas Snippets
Elegantly Reading Multiple CSVs Into Pandas
Automate Boring (ETL) Tasks With Your Own Functions
Build pipelines with Pandas using “pdpipe” - We show how to build intuitive and useful pipelines with Pandas DataFrame using a wonderful little library called pdpipe. Notebook link

Pandas memory usage

df.memory_useage() - Return the memory usage of each column in bytes.

Combining data frames

Tutorial: Concatenation (Combining Data Tables) with Pandas and Python