pandasDataAggregationGroup - juedaiyuer/researchNote GitHub Wiki
#数据聚合与分组运算#
##导入##
>>> from pandas import Series,DataFrame
>>> import pandas as pd
>>> import numpy as np
>>> df=DataFrame({'key1':['a','a','b','b','a'],
... 'key2':['one','two','one','two','one'],
... 'data1':np.random.randn(5),
... 'data2':np.random.randn(5)})
>>> df
data1 data2 key1 key2
0 -0.456158 0.783958 a one
1 -1.437943 1.378467 a two
2 1.046638 0.759526 b one
3 0.771500 0.012541 b two
4 0.522649 -2.933776 a one
按照key1进行分组,并计算data1列的平均值
>>> grouped=df['data1'].groupby(df['key1'])
>>> grouped
<pandas.core.groupby.SeriesGroupBy object at 0x7f83a4118d50>
调动mean方法计算平均值
>>> grouped.mean()
key1
a -0.457151
b 0.909069
Name: data1, dtype: float64
传入多个数组
>>> means=df['data1'].groupby([df['key1'],df['key2']]).mean()
>>> means
key1 key2
a one 0.033245
two -1.437943
b one 1.046638
two 0.771500
Name: data1, dtype: float64
通过两个键对数据进行分组,得到的Series具有一个层次化索引
>>> means.unstack()
key2 one two
key1
a 0.033245 -1.437943
b 1.046638 0.771500
分组键可以是任何长度适当的数组
>>> states=np.array(['Ohio','California','California','Ohio','Ohio'])
>>> years=np.array([2005,2005,2006,2005,2006])
>>> df['data1'].groupby([states,years]).mean()
California 2005 1.596126
2006 -0.828750
Ohio 2005 0.373009
2006 1.012563
Name: data1, dtype: float64
列名(可以是字符串,数字或其它python对象)用作分组键
>>> df.groupby('key1').mean()
data1 data2
key1
a 0.821883 0.274269
b 0.030154 -0.010252