DataFrame - ChannelCMT/OFO GitHub Wiki

DataFrame

series是处理一维数据,dataframe是处理二维数据。

1、 创建DataFrame:

字典转DataFrames

import numpy as np
import pandas as pd
import warnings
warnings.filterwarnings('ignore')

dict_data = {
'a' : [1, 2, 3, 4, 5],
'b' : ['L', 'K', 'J', 'M', 'Z'],
'c' : np.random.randn(5) }
print (dict_data)
frame_data = pd.DataFrame(dict_data, index=pd.date_range('20160101',periods=5))
print (frame_data)
{'a': [1, 2, 3, 4, 5], 'b': ['L', 'K', 'J', 'M', 'Z'], 'c': array([ 0.47581939, -1.47322633, -0.59169248, -1.02384063, -0.09795421])}
            a  b         c
2016-01-01  1  L  0.475819
2016-01-02  2  K -1.473226
2016-01-03  3  J -0.591692
2016-01-04  4  M -1.023841
2016-01-05  5  Z -0.097954
dic = {}
dic['name'] = frame_data

#Series组合成DataFrame的例子:

s_1 = pd.Series([2, 4, 6, 8, 10], name='APPL')
s_2 = pd.Series([1, 3, 5, 7, 9], name="GOOG")
numbers = pd.concat([s_1, s_2], axis=1)
print (numbers)
print (type(numbers))
   APPL  GOOG
0     2     1
1     4     3
2     6     5
3     8     7
4    10     9
<class 'pandas.core.frame.DataFrame'>

直接创建

df = pd.DataFrame(np.random.randn(5,2),index=range(0,10,2),columns=list('AB'))

2、 修改dataframes的列名、索引

print (numbers.columns)
# 修改列名
numbers.columns = ['MSFT', 'YHOO']
print (numbers)
Index(['APPL', 'GOOG'], dtype='object')
   MSFT  YHOO
0     2     1
1     4     3
2     6     5
3     8     7
4    10     9
print (numbers.index)
# 修改索引
numbers.index = pd.date_range("20160101",periods=len(numbers))
print (numbers)
# 读取Dataframe的值
print (numbers.values)
RangeIndex(start=0, stop=5, step=1)
            MSFT  YHOO
2016-01-01     2     1
2016-01-02     4     3
2016-01-03     6     5
2016-01-04     8     7
2016-01-05    10     9
[[ 2  1]
 [ 4  3]
 [ 6  5]
 [ 8  7]
 [10  9]]
numbers.iloc[2,1]
5

3、 访问序列元素

loc works on labels in the index.

iloc works on the positions in the index (so it only takes integers).

df = pd.DataFrame(np.random.randn(5,2),index=range(0,10,2),columns=list('AB'))
print(df)
          A         B
0 -0.825461  0.272779
2 -0.649611 -0.140999
4  0.507637  1.655755
6  0.568669  0.545818
8 -1.532769  0.505135
print(df.iloc[2](/ChannelCMT/OFO/wiki/2))
          A         B
4  0.507637  1.655755
print(df.loc[2](/ChannelCMT/OFO/wiki/2))
          A         B
2 -0.649611 -0.140999

ix

data = pd.DataFrame({'A':[1,2,3],'B':[4,5,6],'C':[7,8,9]},index=["a","b","c"])
print(data)
   A  B  C
a  1  4  7
b  2  5  8
c  3  6  9
data.loc["b","B"]
5
data.iloc[1,1]
5
print(data.loc['b':'c','B':'C'])
   B  C
b  5  8
c  6  9
print(data.iloc[1:3,1:3])
   B  C
b  5  8
c  6  9
print(data.ix[1,1])
print(data.ix["b","B"])
5
5
print(data.ix[1:3,1:3])
print(data.ix['b':'c','B':'C'])
   B  C
b  5  8
c  6  9
   B  C
b  5  8
c  6  9

4、 布尔索引

与Series一样,有时候我们想过滤DataFrame根据一组标准。我们通过索引DataFrame布尔值。

print(data.loc[data['600036.XSHG'].pct_change() > data['601318.XSHG'].pct_change()].head())
                     600036.XSHG  600050.XSHG  601318.XSHG
datetime                                                  
2017-01-04 15:00:00        69.42         8.98        79.87
2017-01-05 15:00:00        69.85         9.48        80.02
2017-01-06 15:00:00        69.35         9.27        79.38
2017-01-10 15:00:00        69.23         9.13        78.90
2017-01-11 15:00:00        69.27         8.34        78.90

5、 添加、删除列,结合DataFrames /Series

当你已经有一个DataFrame的数据,这很好,但同样重要的是能够增加你的数据

添加新数据

new = pd.read_excel('sz50.xlsx',sheetname='600519.XSHG', index_col='datetime')
data['600519.XSHG'] = new.close
print(data.head(5))
                     600036.XSHG  600050.XSHG  601318.XSHG  600519.XSHG
datetime                                                               
2017-01-03 15:00:00        69.31         8.99        79.89      2078.80
2017-01-04 15:00:00        69.42         8.98        79.87      2186.35
2017-01-05 15:00:00        69.85         9.48        80.02      2155.78
2017-01-06 15:00:00        69.35         9.27        79.38      2180.14
2017-01-09 15:00:00        69.23         9.31        79.38      2165.23

删除某一列

data = data.drop('600050.XSHG', axis=1)
print(data.head(5))
                     600036.XSHG  601318.XSHG  600519.XSHG
datetime                                                  
2017-01-03 15:00:00        69.31        79.89      2078.80
2017-01-04 15:00:00        69.42        79.87      2186.35
2017-01-05 15:00:00        69.85        80.02      2155.78
2017-01-06 15:00:00        69.35        79.38      2180.14
2017-01-09 15:00:00        69.23        79.38      2165.23

合并某一列

gold_stock = pd.read_excel('sz50.xlsx',sheetname='600547.XSHG', index_col='datetime')

df=pd.concat([data,gold_stock['close']], axis=1)
print(df.head(5))
                     600036.XSHG  601318.XSHG  600519.XSHG   close
datetime                                                          
2017-01-03 15:00:00        69.31        79.89      2078.80  323.38
2017-01-04 15:00:00        69.42        79.87      2186.35  324.61
2017-01-05 15:00:00        69.85        80.02      2155.78  330.94
2017-01-06 15:00:00        69.35        79.38      2180.14  327.69
2017-01-09 15:00:00        69.23        79.38      2165.23  323.46
df.rename(columns={'close':'600547.XSHG'}, inplace = True)
print(df.head(5))
                     600036.XSHG  601318.XSHG  600519.XSHG  600547.XSHG
datetime                                                               
2017-01-03 15:00:00        69.31        79.89      2078.80       323.38
2017-01-04 15:00:00        69.42        79.87      2186.35       324.61
2017-01-05 15:00:00        69.85        80.02      2155.78       330.94
2017-01-06 15:00:00        69.35        79.38      2180.14       327.69
2017-01-09 15:00:00        69.23        79.38      2165.23       323.46

缺失的数据(再一次)

把一个真实数据输入DataFrame带给我们与在系列中同样的问题,只是这次更多的维度。我们有和系列相同的方法来访问,如下显示。

print(df[df.isnull().values==True])
                     600036.XSHG  601318.XSHG  600519.XSHG  600547.XSHG
datetime                                                               
2017-04-06 15:00:00        73.44        82.76      2435.12          NaN
2017-04-07 15:00:00        72.97        81.97      2441.39          NaN
2017-04-10 15:00:00        73.01        81.61      2417.22          NaN
2017-04-11 15:00:00        73.36        81.03      2394.98          NaN
2017-04-12 15:00:00        73.51        80.63      2424.62          NaN
2017-04-13 15:00:00        73.01        80.43      2476.43          NaN
2017-04-14 15:00:00        72.63        80.38      2456.92          NaN
2017-04-17 15:00:00        72.90        80.67      2459.84          NaN
2017-04-18 15:00:00        71.78        79.96      2511.41          NaN
2017-04-19 15:00:00        71.31        79.51      2521.23          NaN
2017-04-20 15:00:00        70.93        80.18      2580.63          NaN
2017-04-21 15:00:00        72.51        80.79      2492.65          NaN
2017-04-24 15:00:00        73.05        81.01      2478.98          NaN
2017-04-25 15:00:00        73.17        81.66      2603.24          NaN
2017-04-26 15:00:00        73.71        84.01      2597.03          NaN
2017-04-27 15:00:00        73.24        84.71      2621.69          NaN
2017-04-28 15:00:00        73.78        85.00      2565.84          NaN
2017-05-02 15:00:00        73.36        84.93      2559.81          NaN
2017-05-03 15:00:00        73.05        85.11      2594.92          NaN
2017-05-04 15:00:00        72.36        84.68      2585.10          NaN
2017-05-05 15:00:00        72.51        83.41      2586.47          NaN
2017-05-08 15:00:00        72.51        84.59      2516.57          NaN
2017-05-09 15:00:00        72.47        85.11      2541.11          NaN
2017-05-10 15:00:00        73.40        89.36      2547.88          NaN
2017-05-11 15:00:00        74.33        89.07      2560.74          NaN
2017-05-12 15:00:00        78.49        91.72      2568.88          NaN
2017-05-15 15:00:00        79.80        91.65      2606.84          NaN
2017-05-16 15:00:00        79.11        92.05      2673.14          NaN
2017-05-17 15:00:00        77.91        90.53      2650.46          NaN
df_na = df.fillna(method='ffill')
print(df_na.loc['2017-04-26':'2017-05-17'])
                     600036.XSHG  601318.XSHG  600519.XSHG  600547.XSHG
datetime                                                               
2017-04-26 15:00:00        73.71        84.01      2597.03       317.66
2017-04-27 15:00:00        73.24        84.71      2621.69       317.66
2017-04-28 15:00:00        73.78        85.00      2565.84       317.66
2017-05-02 15:00:00        73.36        84.93      2559.81       317.66
2017-05-03 15:00:00        73.05        85.11      2594.92       317.66
2017-05-04 15:00:00        72.36        84.68      2585.10       317.66
2017-05-05 15:00:00        72.51        83.41      2586.47       317.66
2017-05-08 15:00:00        72.51        84.59      2516.57       317.66
2017-05-09 15:00:00        72.47        85.11      2541.11       317.66
2017-05-10 15:00:00        73.40        89.36      2547.88       317.66
2017-05-11 15:00:00        74.33        89.07      2560.74       317.66
2017-05-12 15:00:00        78.49        91.72      2568.88       317.66
2017-05-15 15:00:00        79.80        91.65      2606.84       317.66
2017-05-16 15:00:00        79.11        92.05      2673.14       317.66
2017-05-17 15:00:00        77.91        90.53      2650.46       317.66

6、 dataframe行列计算

print('mean:','\n',data.mean(axis=0))
print('std:','\n',data.std(axis=0))
print(data.describe())
print(onebegin.head(5))
mean: 
 600036.XSHG      88.889860
601318.XSHG     104.962372
600519.XSHG    2823.996047
dtype: float64
std: 
 600036.XSHG     15.039725
601318.XSHG     22.705049
600519.XSHG    532.922368
dtype: float64
       600036.XSHG  601318.XSHG  600519.XSHG
count   215.000000   215.000000   215.000000
mean     88.889860   104.962372  2823.996047
std      15.039725    22.705049   532.922368
min      69.230000    78.820000  2078.800000
25%      73.380000    81.390000  2397.590000
50%      87.370000   107.730000  2834.120000
75%     103.450000   120.980000  3053.000000
max     121.820000   169.570000  4467.820000
                     600036.XSHG  601318.XSHG  600519.XSHG
datetime                                                  
2017-01-03 15:00:00     1.000000     1.000000     1.000000
2017-01-04 15:00:00     1.001587     0.999750     1.051737
2017-01-05 15:00:00     1.007791     1.001627     1.037031
2017-01-06 15:00:00     1.000577     0.993616     1.048749
2017-01-09 15:00:00     0.998846     0.993616     1.041577