Pandas - mwicat/personal GitHub Wiki

Create DataFrame

https://www.geeksforgeeks.org/creating-a-pandas-dataframe-using-list-of-tuples/

From CVS

This infers headers from first line:

import pandas as pd 
data = pd.read_csv("filename.csv") 

From CSV string

data = '''
1 5
2 25
3 50
'''

import pandas as pd 
data = pd.read_csv(pd.compat.StringIO(data)) 

From list of tuples/list

import pandas as pd 
data = [('Peter', 18, 7), 
        ('Riff', 15, 6), 
        ('John', 17, 8), 
        ('Michel', 18, 7), 
        ('Sheli', 17, 5) ] 
df = pd.DataFrame(data, columns =['Name', 'Age', 'Score']) 

From list of namedtuple

from collections import namedtuple
import pandas as pd

User = namedtuple('User', 'user_id data_type count size')
users = [User('user', 10, 20, 30) for _ in xrange(10)]

df = pd.DataFrame(users, columns=User._fields)
df.to_pickle('storage.pkl')

df = pd.read_pickle('storage.pkl')

Get nth row

df.iloc[n]

Add calculated column to dataframe

df['newcol'] = df.apply(lambda row: row.col1, axis=1)

sqlite to DataFrame

from sqlalchemy import create_engine
engine = create_engine('sqlite:///%s' % args.file)
df = pd.read_sql('storage', engine)

Operations

Join

  • how = left|right|outer|inner
df.set_index('key').join(other.set_index('key'), how='left')

Groupby and top 10

data.groupby('Country').Country.count().sort_values(ascending=False).head(10)

Show multiple time datas

import pandas as pd
import matplotlib.pyplot as plt
import matplotlib
matplotlib.style.use('ggplot')

dfs = []

for num, statfile in enumerate(args.statfiles):
    stats = pickle.load(open(statfile))
    ts = stats['timestamps']
    s = pd.Series(index=[x-ts[0] for x in ts], data=1)
    s2 = s.cumsum()
    dfs.append(s2.to_frame(statfile))

df = pd.concat(dfs).sort_index().fillna(method='ffill')

ax = df.plot(grid=True)
ax.set_xlabel("time [s]")
ax.set_ylabel("# requests")

plt.show()

Iterate over dataframe

for index, row in df.iterrows():
    print(row['c1'], row['c2'])

Filter dataframe

df[(df.A == 1) & (df.D == 6)]
df.query('a>0 and 0<b<2')