Pandas&Numpy Tips - duxiaoyao/pdp GitHub Wiki

Numpy advanced

pre-allocated storage!!!  
ndim - array rank, shape, size, dtype, itemsize, strides, data, flat, base, flags, __array_interface__, tree view instead of grids & cubes for understanding N-dimensional array 
slicing - view vs fancy indexing - copy

a = np.arange(24).reshape(3, 8)
b = a.reshape(4, 6)
c = a.T
a.flags
a.__array_interface__
np.info(a)

# stride: taking steps in your data with a window of a fixed size, i.e. window and step
# given an array Arr, a window W and a step S, New_arr_shape = ((Arr.size - W) // S + 1, W), Stride_steps = (S * Arr.strides[-1], Arr.strides[-1])
numpy.lib.stride_tricks.as_strided(Arr, New_arr_shape, Stride_steps)
# check: https://medium.com/analytics-vidhya/a-thorough-understanding-of-numpy-strides-and-its-application-in-data-processing-e40eab1c82fe

Misc

# 计算最大回撤
(df['close'].cummax() - df['close']).max()
(1 - df['close'] / df['close'].cummax()).max()


df[(df.A.ne(0)&df.B.ne(0))&((df.A+df.B)/2).gt(15)].first_valid_index()

简单收益率/累积收益率 和 对数收益率/累积收益率

df['Returns'] = df['close'].pct_change()
df['cumulative_returns'] = (1 + df['Returns']).cumprod() - 1 

df['cumulative_returns'] = df['close'] / df['close'].iloc[0] - 1
df['Returns'] = (1 + df['cumulative_returns']).pct_change()

df['log_return'] = np.log(1 + df['Returns'])
df['log_cum_return'] = np.cumsum(df['log_return'])
df['cumulative_returns'] = np.expm1(df['log_cum_return'])

数据转换

df = pd.read_excel(f'select_pool_{label}_weight.xlsx', usecols=['证券代码', 'Subpool', 'weight'], dtype={'Subpool': SUBPOOL_DTYPE}, engine='openpyxl')
df.insert(0, 'quarter', quarter)
df = pd.concat(dfs, ignore_index=True)
df.rename(columns={'证券代码': 'symbol', 'Subpool': 'subpool'}, inplace=True)
df.set_index(['quarter', 'subpool', 'symbol'], inplace=True)
start_dt = start_quarter.asfreq('D', 'start').to_timestamp().date()
end_dt = end_quarter.asfreq('D', 'end').to_timestamp().date()
pool.index.levels[0]  # 包括可能没用到的值,可以移除:pandas.MultiIndex.remove_unused_levels
pool.loc[quarter].index.get_level_values(1)  # 从列中取到的值,都是有被使用的
pool.loc[quarter].index.unique(level=1)  # 从列中取到的值,都是有被使用的
df['eob'] = df.eob.dt.to_period('D')
df.sort_values(by='eob', inplace=True)
df.reset_index(drop=True, inplace=True)
df['volume'] = df.volume.astype('int')
df.to_feather(path, compression='lz4')
df = pd.read_feather(get_idx_path())
df = df.pivot(index='eob', columns='symbol').swaplevel(0, 1, axis=1).sort_index(axis=1, level=0)
returns = pd.concat([idx_ohlcv.open.iloc[0](/duxiaoyao/pdp/wiki/0), idx_ohlcv.close]).pct_change().dropna()
stock_returns = stock_ohlcv.swaplevel(0, 1, axis=1).close.pct_change().fillna(0)
quarterly_subpool = pool.loc[quarter].loc[label]
subpool_weight = quarterly_subpool.weight.sum()
quarterly_active_subpool = quarterly_subpool[quarterly_subpool.index.isin(stock_returns.columns)]
stock_returns[quarterly_active_subpool.index].dot(quarterly_active_subpool.weight) / subpool_weight
stock_returns.dot(pool.loc[quarter].droplevel(0).loc[stock_returns.columns].weight)

把季度对应的值赋予给日期dataframe

df_daily = pd.DataFrame({'val': range(1, 201)}, index=pd.date_range('2009-06-20', periods=200))
quarter_end_dts = df_daily.index.to_period('Q').asfreq('D').to_timestamp()
quarter_index = quarter_end_dts.unique()
df_quarter = pd.DataFrame({'qr_val': range(101, 101 + len(quarter_index))}, index=quarter_index)
df_daily['qr_val'] = df_quarter.loc[quarter_end_dts, 'qr_val'].to_numpy()
# df_daily['qr_val'] = df_quarter['qr_val'].asof(quarter_end_dts).to_numpy(dtype='int')
# df_daily['qr_val'] = df_daily.apply(lambda x: df_quarter.at[x.name.to_period('Q').asfreq('D').to_timestamp(), 'qr_val'], axis=1)
df3 = pd.DataFrame({'val3': [f'#{i}' for i in range(1, len(df_quarter) + 1)]}, index=[quarter_end_dt - pd.Timedelta(np.random.randint(0, 5), 'days') for quarter_end_dt in df_quarter.index])
df_daily['val3_backward'] = df3.asof(df_daily.index)
df_daily['val3_forward'] = pd.merge_asof(df_daily, df3, left_index=True, right_index=True, direction='forward')['val3']
df_daily.loc['2009-06-20':'2009-07-01', 'x'] = 'b'
df_quarter
df3
df_daily['2009-06-24':'2009-07-04']
df_daily['2009-09-24':'2009-10-04']

和时间处理相关的API

numpy: where, select, argpartition, sort, diff, count_nonzero  
pandas: Timestamp, Period, date_range, period_range, to_datetime, to_timedelta, merge_asof  
DataFrame: resample, idmax, idmin, asof, at_time, between_time, first, last, first_valid_index, last_valid_index, to_period, to_timestamp, asfreq, mask, where   
Index: get_loc, asof, asof_locs, argmin, argmax, argsort, searchsorted  
PeriodIndex: asfreq  

Get all the details of student with maximum/minimum score as shown below

df.loc[df['Score'].idxmax()]
df.loc[df['Score'].idxmin()]

缺省数据发现和处理

total= df_train.isnull().sum().sort_values(ascending=False)
percent = (df_train.isnull().sum()/df_train.count()).sort_values(ascending=False)
missing_data = pd.concat([total, percent], axis=1, keys=['Total','Percent'])
missing_data.head(20)

df_train= df_train.drop((missing_data[missing_data['Total'] > 1]).index,1)
df_train= df_train.drop(df_train.loc[df_train['Electrical'].isnull()].index)
df_train.isnull().sum().max() #justchecking that there's no missing data missing..

Set up formatting so larger numbers aren't displayed in scientific notation

pd.set_option('display.float_format', lambda x: '%.3f' % x)

split pandas dataframe into two random subsets

from sklearn.model_selection import train_test_split
train, test = train_test_split(df, test_size=0.2)

Get the unique values of ‘B’ column
df.B.unique()

Get number of unique values in column 'C'
df.C.nunique(dropna = True)

Get frequency count of column A

count = df['A'].value_counts()  
count = df.groupby(['A']).count()  
count = df.groupby(['A', 'B']).size() 

Get five largest/smallest values in column age

df.nlargest(5, ['Age'])
df.nsmallest(5, ['Age'])

Get duplicate rows

duplicate_except_first = df[df.duplicated()]
duplicate_except_last = df[df.duplicated(keep='last'))]
duplicate_with_all = df[df.duplicated(keep='false'))]

Drop duplicate rows

df.drop_duplicates()
df.drop_duplicates(keep='last')
df.drop_duplicates(keep='false')

Output a dataframe without NaN values for a particular column
df[df['id'].notnull()]

Output a dataframe with values for a particular column cannot be transformed to number
df[pd.to_numeric(df.profit, errors='coerce').isnull()]

Output a dataframe with values for a particular column can be transformed to number
df[pd.to_numeric(df.profit, errors='coerce').notnull()]

Change dtypes
When a dataset gets larger, we need to convert the dtypes in order to save memory

df[col_int] = df[col_int].astype('int32')
df[col_float] = df[col_float].astype('float32')

check the number of missing data for each column
df.isnull().sum().sort_values(ascending=False)

Remove white space in columns

df[col] = df[col].str.lstrip()
df[col] = df[col].str.strip()

Convert from string to datetime format
df.insert(loc=2, column='timestamp', value=pd.to_datetime(df.transdate, format='%Y-%m-%d %H:%M:%S.%f'))

Concat 2 columns with strings if the last 3 letters of the first column are 'pil'

mask = df['col_1'].str.endswith('pil', na=False)
col_new = df[mask]['col_1'] + df[mask]['col_2']
col_new.replace('pil', ' ', regex=True, inplace=True) # replace the 'pil' with emtpy space

converting numerical values in a column to discrete intervals based on the range specified
Use pd.cut or pd.qcut if you want to convert a continuous variable to a categorical variable