04a 04 Merging Ordered and Time Series Data - HannaAA17/Data-Scientist-With-Python-datacamp GitHub Wiki

Using merge_ordered()

  • ordered data/time series
  • Filling in missing values
  • calling : pd.merge_ordered(df1, df2)
  • forward fill: fills missing with previous value fill_method='ffill'
# Use merge_ordered() to merge gdp and sp500 on year and date
gdp_sp500 = pd.merge_ordered(gdp, sp500, left_on='year', right_on='date', 
                             how='left')

# Use merge_ordered() to merge gdp and sp500, interpolate missing value
gdp_sp500 = pd.merge_ordered(
    gdp, sp500,
    left_on='year', right_on='date', how='left',
    fill_method='ffill' # forward_fill
    )

# Use merge_ordered() to merge gdp and sp500, interpolate missing value
gdp_sp500 = pd.merge_ordered(gdp, sp500, left_on='year', right_on='date', 
                             how='left',  fill_method='ffill')

# Subset the gdp and returns columns
gdp_returns = gdp_sp500['gdp', 'returns'](/HannaAA17/Data-Scientist-With-Python-datacamp/wiki/'gdp',-'returns')

# Print gdp_returns correlation
print (gdp_returns.corr())

# Use merge_ordered() to merge inflation, unemployment with inner join
inflation_unemploy = pd.merge_ordered(
    inflation, unemployment,
    on='date', how='inner'
)

# Print inflation_unemploy 
print(inflation_unemploy)

# Plot a scatter plot of unemployment_rate vs cpi of inflation_unemploy
inflation_unemploy.plot(kind='scatter', x='unemployment_rate', y='cpi')
plt.show()



merge_ordered() caution, multiple columns

  • When using merge_ordered() to merge on multiple columns, the order is important when you combine it with the forward fill feature. The function sorts the merge on columns in the order provided

date -> country

# Merge gdp and pop on country and date with fill
date_ctry = pd.merge_ordered(gdp, pop, on=['country','date'], fill_method='ffill')

output

         date country          gdp  series_code_x         pop series_code_y
0  2015-01-01   China  16700619.84  NYGDPMKTPSAKN  1371220000   SP.POP.TOTL
1  2015-01-01      US   4319400.00  NYGDPMKTPSAKN   320742673   SP.POP.TOTL
2  2015-04-01   China  16993555.30  NYGDPMKTPSAKN   320742673   SP.POP.TOTL
3  2015-04-01      US   4351400.00  NYGDPMKTPSAKN   320742673   SP.POP.TOTL
4  2015-07-01   China  17276271.67  NYGDPMKTPSAKN   320742673   SP.POP.TOTL

country -> date

# Merge gdp and pop on country and date with fill
date_ctry = pd.merge_ordered(gdp, pop, on=['country','date'], fill_method='ffill')

output

         date country          gdp  series_code_x         pop series_code_y
0  2015-01-01   China  16700619.84  NYGDPMKTPSAKN  1371220000   SP.POP.TOTL
1  2015-04-01   China  16993555.30  NYGDPMKTPSAKN  1371220000   SP.POP.TOTL
2  2015-07-01   China  17276271.67  NYGDPMKTPSAKN  1371220000   SP.POP.TOTL
3  2015-09-01   China  17547024.54  NYGDPMKTPSAKN  1371220000   SP.POP.TOTL
4  2016-01-01   China  17820789.83  NYGDPMKTPSAKN  1378665000   SP.POP.TOTL

Using merge_asof()

  • Similar to a merge_ordered() left_join
    • similar features as merge_ordered()
  • Match on the nearest key column and not exact matches.
    • Merge 'on' colomns must be sorted.
  • direction='forward', direction='nearest'
# Use merge_asof() to merge jpm and wells
jpm_wells = pd.merge_asof(
    jpm, wells, on='date_time', suffixes=('', '_wells'), direction='nearest'
)


# Use merge_asof() to merge jpm_wells and bac
jpm_wells_bac = pd.merge_asof(
    jpm_wells, bac, on='date_time', suffixes=('_jpm', '_bac'), direction='nearest'
)

# Compute price diff
price_diffs = jpm_wells_bac.diff()

# Plot the price diff of the close of jpm, wells and bac only
price_diffs.plot(y=['close_jpm', 'close_wells', 'close_bac'])
plt.show()

# Merge gdp and recession on date using merge_asof()
gdp_recession = pd.merge_asof(gdp, recession, on='date')

# Create a list based on the row value of gdp_recession['econ_status']
is_recession = ['r' if s=='recession' else 'g' for s in gdp_recession['econ_status']]

# Plot a bar chart of gdp_recession
gdp_recession.plot(kind='bar', y='gdp', x='date', color=is_recession, rot=90)
plt.show()


Selecting data with .query()

  • .query('some selection statement')
  • stocks.query('nike >= 90')
  • stocks.query('nike > 90 and disney < 140')
  • stocks_long.query('stock=="disney" or (stock=="nike" and close<90)')

Reshaping data with .melt()

  • The melt method allow us to unpivot our dataset to long format
In [2]: ten_yr.head()
Out[2]: 
  metric  2007-02-01  2007-03-01  2007-04-01  2007-05-01  ...  2009-08-01  2009-09-01  2009-10-01  2009-11-01  2009-12-01
0   open    0.033491   -0.060449    0.025426   -0.004312  ...   -0.006687   -0.046564   -0.032068    0.034347   -0.050544
1   high   -0.007338   -0.040657    0.022046    0.030576  ...    0.031864   -0.090324    0.012447   -0.004191    0.099327
2    low   -0.016147   -0.007984    0.031075   -0.002168  ...    0.039510   -0.035946   -0.050733    0.030264    0.007188
3  close   -0.057190    0.021538   -0.003873    0.056156  ...   -0.028563   -0.027639    0.025703   -0.056309    0.200562

[4 rows x 36 columns]
# Use melt on ten_yr, unpivot everything besides the metric column
bond_perc = ten_yr.melt(id_vars='metric', var_name='date', value_name='close')
In [3]: bond_perc.head()
Out[3]: 
  metric        date     close
0   open  2007-02-01  0.033491
1   high  2007-02-01 -0.007338
2    low  2007-02-01 -0.016147
3  close  2007-02-01 -0.057190
4   open  2007-03-01 -0.060449

In [4]: dji.head()
Out[4]: 
         date     close
0  2007-02-01  0.005094
1  2007-03-01 -0.026139
2  2007-04-01  0.048525
3  2007-05-01  0.052007
4  2007-06-01 -0.016070
# Use query on bond_perc to select only the rows where metric=close
bond_perc_close = bond_perc.query('metric=="close"')

# Merge (ordered) dji and bond_perc_close on date with an inner join
dow_bond = pd.merge_ordered(dji, bond_perc_close, on='date', how='inner',suffixes=['_dow','_bond'])

In [5]: dow_bond.head()
Out[5]: 
         date  close_dow metric  close_bond
0  2007-02-01   0.005094  close   -0.057190
1  2007-03-01  -0.026139  close    0.021538
2  2007-04-01   0.048525  close   -0.003873
3  2007-05-01   0.052007  close    0.056156
4  2007-06-01  -0.016070  close    0.029243

# Plot only the close_dow and close_bond columns
dow_bond.plot(y=['close_dow','close_bond'], x='date', rot=90)
plt.show()