Pandas - BKJackson/BKJackson_Wiki GitHub Wiki
Fast looping through Pandas
Use df.itertuples() not df.iterrows()
%%time
df = pd.DataFrame(data={'Values': range(1,100000)}
sum = 0
for row in df.itertuples(index=False):
sum += row.Values
print(sum)
Wall time: 99.8 ms
Merge two dataframes
df = pd.merge(df, geoloc, how='inner', on=HYDRO_CODE_COL, sort=False, copy=True, indicator=False, validate=None)
Calculate number of nans in each column
df.isna().sum()
View data types of each column
df.dtypes
Flatten json file into a Pandas dataframe
From Flatten Nested JSON in Pandas
import json
import pandas as pd
from pandas.io.json import json_normalize
with open('../input/raw_nyc_phil.json') as f:
d = json.load(f)
nycphil = json_normalize(d['programs'])
Set a dataframe column to categorical type and then get summary stats for other columns
iris['species'] = iris['species'].astype('category')
values = ['count', 'min', 'max', 'mean', 'std']
iris.groupby(by='species').agg(values)
Pandas String Methods
Methods similar to Python string methods
Nearly all Python's built-in string methods are mirrored by a Pandas vectorized string method. Here is a list of Pandas str methods that mirror Python string methods:
len() |
lower() |
translate() |
islower() |
ljust() |
upper() |
startswith() |
isupper() |
rjust() |
find() |
endswith() |
isnumeric() |
center() |
rfind() |
isalnum() |
isdecimal() |
zfill() |
index() |
isalpha() |
split() |
strip() |
rindex() |
isdigit() |
rsplit() |
rstrip() |
capitalize() |
isspace() |
partition() |
lstrip() |
swapcase() |
istitle() |
rpartition() |
Apply using str attribute:
names = pd.Series(data)
names.str.capitalize()
Get recipe name with the longest ingredient list using np.argmax()
From the recipes dataframe:
recipes.name[np.argmax(recipes.ingredients.str.len())]
Count how many of the recipes are for breakfast food
recipes.description.str.contains('[Bb]reakfast').sum()
Map words to numbers in a dict using map
# Inspect payment type
df_train.payment_type.str.lower().value_counts(progress=True)
map_payment_type = {'crd':1, 'cash':2, '1':1, 'cas':2, '2':2, 'no charge':3, 'no':3, '3':3}
df['payment_type_num'] = df_train.payment_type.str.lower().map(map_payment_type, default_value=-1, allow_missing=True)
Working with Time Series in Pandas
Pandas date and time data comes in a few flavors:
- Time stamps reference particular moments in time (e.g., July 4th, 2015 at 7:00am). Pandas provides the Timestamp type. The associated Index structure is DatetimeIndex.
- Time intervals and periods reference a length of time between a particular beginning and end point; for example, the year 2015. Periods usually reference a special case of time intervals in which each interval is of uniform length and does not overlap (e.g., 24 hour-long periods comprising days). This encodes a fixed-frequency interval based on numpy.datetime64. Pandas provides the Period type. The associated index structure is PeriodIndex.
- Time deltas or durations reference an exact length of time (e.g., a duration of 22.56 seconds). Pandas provides the Timedelta type. The associated index structure is TimedeltaIndex.
Native Python datetime and dateutil
Manually build a date using the datetime type
from datetime import datetime
datetime(year=2015, month=7, day=4)
Output: datetime.datetime(2015, 7, 4, 0, 0)
Using the dateutil module, you can parse dates from a variety of string formats
from dateutil import parser
date = parser.parse("4th of July, 2015")
date
Output: datetime.datetime(2015, 7, 4, 0, 0)
Once you have a datetime object, you can do things like printing the day of the week
date.strftime('%A')
Output: 'Saturday'
Datetime docs for strftime() and strptime() Format codes: Link
Typed arrays of times: NumPy's datetime64
The weaknesses of Python's datetime format inspired the NumPy team to add a set of native time series data type to NumPy.
The datetime64 dtype encodes dates as 64-bit integers, and thus allows arrays of dates to be represented very compactly.
The datetime64 requires a very specific input format:
import numpy as np
date = np.array('2015-07-04', dtype=np.datetime64)
date
Output: array(datetime.date(2015, 7, 4), dtype='datetime64[D]')
Once we have this date formatted, however, we can quickly do vectorized operations on it:
date + np.arange(12)
Output: array(['2015-07-04', '2015-07-05', '2015-07-06', '2015-07-07', '2015-07-08', '2015-07-09', '2015-07-10', '2015-07-11', '2015-07-12', '2015-07-13', '2015-07-14', '2015-07-15'], dtype='datetime64[D]')
You can force any desired fundamental unit using one of many format codes; for example, here we'll force a nanosecond-based time
np.datetime64('2015-07-04 12:59:59.50', 'ns')
Output: numpy.datetime64('2015-07-04T12:59:59.500000000')
The following table, drawn from the NumPy datetime64 documentation, lists the available format codes along with the relative and absolute timespans that they can encode:
| Code | Meaning | Time span (relative) | Time span (absolute) |
|---|---|---|---|
Y |
Year | ± 9.2e18 years | [9.2e18 BC, 9.2e18 AD] |
M |
Month | ± 7.6e17 years | [7.6e17 BC, 7.6e17 AD] |
W |
Week | ± 1.7e17 years | [1.7e17 BC, 1.7e17 AD] |
D |
Day | ± 2.5e16 years | [2.5e16 BC, 2.5e16 AD] |
h |
Hour | ± 1.0e15 years | [1.0e15 BC, 1.0e15 AD] |
m |
Minute | ± 1.7e13 years | [1.7e13 BC, 1.7e13 AD] |
s |
Second | ± 2.9e12 years | [ 2.9e9 BC, 2.9e9 AD] |
ms |
Millisecond | ± 2.9e9 years | [ 2.9e6 BC, 2.9e6 AD] |
us |
Microsecond | ± 2.9e6 years | [290301 BC, 294241 AD] |
ns |
Nanosecond | ± 292 years | [ 1678 AD, 2262 AD] |
ps |
Picosecond | ± 106 days | [ 1969 AD, 1970 AD] |
fs |
Femtosecond | ± 2.6 hours | [ 1969 AD, 1970 AD] |
as |
Attosecond | ± 9.2 seconds | [ 1969 AD, 1970 AD] |
Working with time zones
pytz - contains tools for working with the most migrane-inducing piece of time series data: time zones.
Setting time-based index
index = pd.DatetimeIndex(['2014-07-04', '2014-08-04',
'2015-07-04', '2015-08-04'])
data = pd.Series([0, 1, 2, 3], index=index)
data
Output:
2014-07-04 0
2014-08-04 1
2015-07-04 2
2015-08-04 3
dtype: int64
Create a regular date range sequence with pd.date_range()
pd.date_range() accepts a start date, an end date, and an optional frequency code to create a regular sequence of dates. By default, the frequency is one day.
pd.date_range('2015-07-03', '2015-07-10')
Output: DatetimeIndex(['2015-07-03', '2015-07-04', '2015-07-05', '2015-07-06', '2015-07-07', '2015-07-08', '2015-07-09', '2015-07-10'], dtype='datetime64[ns]', freq='D')
Here we will construct a range of hourly timestamps.
pd.date_range('2015-07-03', periods=8, freq='H')
Output:
DatetimeIndex(['2015-07-03 00:00:00', '2015-07-03 01:00:00',
'2015-07-03 02:00:00', '2015-07-03 03:00:00',
'2015-07-03 04:00:00', '2015-07-03 05:00:00',
'2015-07-03 06:00:00', '2015-07-03 07:00:00'],
dtype='datetime64[ns]', freq='H')
Regular sequences of Period or Timedelta values
To create regular sequences of Period or Timedelta values, the very similar pd.period_range() and pd.timedelta_range() functions are useful.
Monthly periods
pd.period_range('2015-07', periods=8, freq='M')
Output:
PeriodIndex(['2015-07', '2015-08', '2015-09', '2015-10', '2015-11', '2015-12',
'2016-01', '2016-02'],
dtype='int64', freq='M')
A sequence of durations increasing by an hour
pd.timedelta_range(0, periods=10, freq='H')
Output:
TimedeltaIndex(['00:00:00', '01:00:00', '02:00:00', '03:00:00', '04:00:00',
'05:00:00', '06:00:00', '07:00:00', '08:00:00', '09:00:00'],
dtype='timedelta64[ns]', freq='H')
for a frequency of 2 hours 30 minutes, we can combine the hour (H) and minute (T) codes as follows
pd.timedelta_range(0, periods=9, freq="2H30T")
TimedeltaIndex(['00:00:00', '02:30:00', '05:00:00', '07:30:00', '10:00:00', '12:30:00', '15:00:00', '17:30:00', '20:00:00'], dtype='timedelta64[ns]', freq='150T')
All of these short codes refer to specific instances of Pandas time series offsets, which can be found in the pd.tseries.offsets module.
H = hours
T = minutes
S = seconds
Another way:
data['date'] = data.date.astype('datetime64[ns]')
data = data.set_index('date')
Another way:
df.index = pd.to_datetime(df.index)
Get index type
type(df.index)
Convert any DatetimeIndex to a PeriodIndex with to_period()
dates = pd.to_datetime([datetime(2015, 7, 3), '4th of July, 2015',
'2015-Jul-6', '07-07-2015', '20150708'])
dates.to_period('D')
Output: PeriodIndex(['2015-07-03', '2015-07-04', '2015-07-06', '2015-07-07', '2015-07-08'], dtype='int64', freq='D')
Create a TimedeltaIndex when a date is subtracted from another
dates - dates[0]
Output: TimedeltaIndex(['0 days', '1 days', '3 days', '4 days', '5 days'], dtype='timedelta64[ns]', freq=None)
Reorganize a dataframe where time is the index, column names correspond to a categorical column, and mean values are a column of their own
df = df.pivot(index='Date', columns='Source', values='Mean')
Scatter plot to see how to measurements correlate over time
plt.scatter(df['1880':'1900']['GCAG'](/BKJackson/BKJackson_Wiki/wiki/'GCAG'), df['1880':'1900']['GISTEMP'](/BKJackson/BKJackson_Wiki/wiki/'GISTEMP'))
Pandas data I/O
Reading multiple files in one line using Pandas and glob
import glob
import pandas as pd
df = pd.concat([pd.read_csv(f, encoding='latin1') for f in glob.glob('data*.csv'), ignore_index=True])
Pandas Visualization
Quick and easy plot of first 100 samples of a column
df.Mean[:100].plot()
Plot two columns within a time index range
plt.plot(df['1880':'1950']['GCAG', 'GISTEMP'](/BKJackson/BKJackson_Wiki/wiki/'GCAG',-'GISTEMP'))
Plotting categories with pandas Categorical
from pandas import Categorical
def categorify(df, keys):
if isinstance(keys, str):
keys = (keys, )
for key in keys:
cat_map = {cat: i for i, cat in enumerate(set(df[key]))}
df[key] = df[key].apply(lambda x: cat_map[x])
Example usage:
df = pd.read_csv('../data/iris.csv')
lib.categorify(df, 'species')
df.plot.scatter('sepal_width', 'sepal_length', c='species', colormap='viridis')
Pandas plot with mask, groupby, sum, and sum
nba[nba["fran_id"] == "Knicks"].groupby("year_id")["pts"].sum().plot()
Index as an ordered set
Pandas objects are designed to facilitate operations such as joins across datasets, which depend on many aspects of set arithmetic.
The Index object follows many of the conventions used by Python's built-in set data structure, so that unions, intersections, differences, and other combinations can be computed in a familiar way:
indA = pd.Index([1, 3, 5, 7, 9])
indB = pd.Index([2, 3, 5, 7, 11])
indA & indB # intersection
output: Int64Index([3, 5, 7], dtype='int64')
indA | indB # union
output: Int64Index([1, 2, 3, 5, 7, 9, 11], dtype='int64')
indA ^ indB # symmetric difference
Int64Index([1, 2, 9, 11], dtype='int64')
Rolling windows
Rolling mean and standard deviation
rolling = goog.rolling(365, center=True)
data = pd.DataFrame({'input': goog,
'one-year rolling_mean': rolling.mean(),
'one-year rolling_std': rolling.std()})
data['mean_plus'] = data['one-year rolling_mean'] + rolling.std()
data['mean_minus'] = data['one-year rolling_mean'] - rolling.std()
ax = data.plot(style=['-', '--', ':', ':', ':'])
ax.lines[0].set_alpha(0.3)
Pandas tips and articles
Useful Pandas Snippets
Elegantly Reading Multiple CSVs Into Pandas
Automate Boring (ETL) Tasks With Your Own Functions
Build pipelines with Pandas using “pdpipe” - We show how to build intuitive and useful pipelines with Pandas DataFrame using a wonderful little library called pdpipe. Notebook link
Pandas memory usage
df.memory_useage() - Return the memory usage of each column in bytes.
Combining data frames
Tutorial: Concatenation (Combining Data Tables) with Pandas and Python