Pandas - mwicat/personal GitHub Wiki
Create DataFrame
https://www.geeksforgeeks.org/creating-a-pandas-dataframe-using-list-of-tuples/
From CVS
This infers headers from first line:
import pandas as pd
data = pd.read_csv("filename.csv")
From CSV string
data = '''
1 5
2 25
3 50
'''
import pandas as pd
data = pd.read_csv(pd.compat.StringIO(data))
From list of tuples/list
import pandas as pd
data = [('Peter', 18, 7),
('Riff', 15, 6),
('John', 17, 8),
('Michel', 18, 7),
('Sheli', 17, 5) ]
df = pd.DataFrame(data, columns =['Name', 'Age', 'Score'])
From list of namedtuple
from collections import namedtuple
import pandas as pd
User = namedtuple('User', 'user_id data_type count size')
users = [User('user', 10, 20, 30) for _ in xrange(10)]
df = pd.DataFrame(users, columns=User._fields)
df.to_pickle('storage.pkl')
df = pd.read_pickle('storage.pkl')
Get nth row
df.iloc[n]
Add calculated column to dataframe
df['newcol'] = df.apply(lambda row: row.col1, axis=1)
sqlite to DataFrame
from sqlalchemy import create_engine
engine = create_engine('sqlite:///%s' % args.file)
df = pd.read_sql('storage', engine)
Operations
Join
- how = left|right|outer|inner
df.set_index('key').join(other.set_index('key'), how='left')
Groupby and top 10
data.groupby('Country').Country.count().sort_values(ascending=False).head(10)
Show multiple time datas
import pandas as pd
import matplotlib.pyplot as plt
import matplotlib
matplotlib.style.use('ggplot')
dfs = []
for num, statfile in enumerate(args.statfiles):
stats = pickle.load(open(statfile))
ts = stats['timestamps']
s = pd.Series(index=[x-ts[0] for x in ts], data=1)
s2 = s.cumsum()
dfs.append(s2.to_frame(statfile))
df = pd.concat(dfs).sort_index().fillna(method='ffill')
ax = df.plot(grid=True)
ax.set_xlabel("time [s]")
ax.set_ylabel("# requests")
plt.show()
Iterate over dataframe
for index, row in df.iterrows():
print(row['c1'], row['c2'])
Filter dataframe
df[(df.A == 1) & (df.D == 6)]
df.query('a>0 and 0<b<2')