04 03 Merging Data - ellen567/Data-Science-Notebook GitHub Wiki

pd.merge() extend pd.concat()'s ability to align rows using multiple columns rather than indexes

Merging DataFrames

  • pd.merge(population,cities): merge on all columns appearing both dataframes

Merging on specific columns

merge_by_city = pd.merge(revenue,managers,on='city')

Merging on columns with non-matching labels

# Merge revenue & managers on 'city' & 'branch': combined
combined = pd.merge(revenue, managers, left_on ='city', right_on = 'branch')

          city  branch_id state  revenue
0       Austin         10    TX      100
1       Denver         20    CO       83
2  Springfield         30    IL        4
3    Mendocino         47    CA      200

        branch  branch_id state   manager
0       Austin         10    TX  Charlers
1       Denver         20    CO      Joel
2    Mendocino         47    CA     Brett
3  Springfield         31    MO     Sally

# output:
         city  branch_id_x state_x  revenue       branch  branch_id_y state_y   manager
0       Austin           10      TX      100       Austin           10      TX  Charlers
1       Denver           20      CO       83       Denver           20      CO      Joel
2  Springfield           30      IL        4  Springfield           31      MO     Sally
3    Mendocino           47      CA      200    Mendocino           47      CA     Brett

Using suffixes

pd.merge(bronze, gold, on=['NOC', 'Country'], suffixes['_bronze', '_gold']

Joining DataFrames

  • df1.append(df2): stacking vertically
  • pd.contact([df1, df2]):
    • stacking vertically or horizontally
    • simple inner/outer joins on indexes
  • df1.join(df2): inner/outer/left/right joins on indexes
  • pd.merge([df1,df2]): many joins on multiple columns, default=inner

Left & right merging on multiple columns

# Merge revenue and sales: revenue_and_sales
revenue_and_sales = pd.merge(revenue,sales,how='right',on=['city','state'])

# Print revenue_and_sales
print(revenue_and_sales)

# Merge sales and managers: sales_and_managers
sales_and_managers = pd.merge(sales,managers,how="left",left_on=['city','state'],right_on=['branch','state'])

# Print sales_and_managers
print(sales_and_managers)

         city  branch_id state  revenue  units
0       Austin       10.0    TX    100.0      2
1       Denver       20.0    CO     83.0      4
2  Springfield       30.0    IL      4.0      1
3    Mendocino       47.0    CA    200.0      1
4  Springfield        NaN    MO      NaN      5
          city state  units       branch  branch_id   manager
0    Mendocino    CA      1    Mendocino       47.0     Brett
1       Denver    CO      4       Denver       20.0      Joel
2       Austin    TX      2       Austin       10.0  Charlers
3  Springfield    MO      5  Springfield       31.0     Sally
4  Springfield    IL      1          NaN        NaN       NaN

Merging DataFrames with outer join

# Perform the first merge: merge_default
merge_default = pd.merge(sales_and_managers,revenue_and_sales)

# Print merge_default
print(merge_default)

# Perform the second merge: merge_outer
merge_outer = pd.merge(sales_and_managers,revenue_and_sales,how='outer')

# Print merge_outer
print(merge_outer)

# Perform the third merge: merge_outer_on
merge_outer_on = pd.merge(sales_and_managers,revenue_and_sales,how='outer',on=['city','state'])

# Print merge_outer_on
print(merge_outer_on)

       city state  units     branch  branch_id   manager  revenue
0  Mendocino    CA      1  Mendocino       47.0     Brett    200.0
1     Denver    CO      4     Denver       20.0      Joel     83.0
2     Austin    TX      2     Austin       10.0  Charlers    100.0
          city state  units       branch  branch_id   manager  revenue
0    Mendocino    CA      1    Mendocino       47.0     Brett    200.0
1       Denver    CO      4       Denver       20.0      Joel     83.0
2       Austin    TX      2       Austin       10.0  Charlers    100.0
3  Springfield    MO      5  Springfield       31.0     Sally      NaN
4  Springfield    IL      1          NaN        NaN       NaN      NaN
5  Springfield    IL      1          NaN       30.0       NaN      4.0
6  Springfield    MO      5          NaN        NaN       NaN      NaN
          city state  units_x       branch  branch_id_x   manager  branch_id_y  revenue  units_y
0    Mendocino    CA        1    Mendocino         47.0     Brett         47.0    200.0        1
1       Denver    CO        4       Denver         20.0      Joel         20.0     83.0        4
2       Austin    TX        2       Austin         10.0  Charlers         10.0    100.0        2
3  Springfield    MO        5  Springfield         31.0     Sally          NaN      NaN        5
4  Springfield    IL        1          NaN          NaN       NaN         30.0      4.0        1

Ordered merges

pd.merge(hardware, software, how='outer').sorted_values('Date') = pd.merge_ordered(hardware,software)

# Perform the first ordered merge: tx_weather
tx_weather = pd.merge_ordered(austin, houston)

# Print tx_weather
print(tx_weather)

         date ratings
    0 2016-01-01  Cloudy
    1 2016-01-04   Rainy
    2 2016-01-17   Sunny
    3 2016-02-08  Cloudy
    4 2016-03-01   Sunny

# Perform the second ordered merge: tx_weather_suff
tx_weather_suff = pd.merge_ordered(austin, houston, on='date',suffixes=['_aus','_hus'])

# Print tx_weather_suff
print(tx_weather_suff)

            date ratings_aus ratings_hus
    0 2016-01-01      Cloudy      Cloudy
    1 2016-01-04         NaN       Rainy
    2 2016-01-17       Sunny         NaN
    3 2016-02-08      Cloudy         NaN
    4 2016-03-01         NaN       Sunny

# Perform the third ordered merge: tx_weather_ffill
tx_weather_ffill = pd.merge_ordered(austin, houston, on='date',suffixes=['_aus','_hus'],fill_method='ffill')

# Print tx_weather_ffill
print(tx_weather_ffill)

           date ratings_aus ratings_hus
    0 2016-01-01      Cloudy      Cloudy
    1 2016-01-04      Cloudy       Rainy
    2 2016-01-17       Sunny       Rainy
    3 2016-02-08      Cloudy       Rainy
    4 2016-03-01      Cloudy       Sunny

Using merge_asof()

  • Similar to pd.merge_ordered(), the pd.merge_asof() function will also merge values in order using the on column, but for each row in the left DataFrame, only rows from the right DataFrame whose 'on' column values are less than the left value will be kept.
oil
        Date  Price
0 1970-01-01   3.35
1 1970-02-01   3.35
2 1970-03-01   3.35
3 1970-04-01   3.35
4 1970-05-01   3.35

auto
    mpg  cyl  displ   hp  weight  accel         yr origin                       name
0  18.0    8  307.0  130    3504   12.0 1970-01-01     US  chevrolet chevelle malibu
1  15.0    8  350.0  165    3693   11.5 1970-01-01     US          buick skylark 320
2  18.0    8  318.0  150    3436   11.0 1970-01-01     US         plymouth satellite
3  16.0    8  304.0  150    3433   12.0 1970-01-01     US              amc rebel sst
4  17.0    8  302.0  140    3449   10.5 1970-01-01     US                ford torino

# Merge auto and oil: merged
merged = pd.merge_asof(auto,oil,left_on='yr',right_on='Date')

# Print the tail of merged
print(merged.tail())

      mpg  cyl  displ  hp  weight  ...         yr  origin             name       Date  Price
387  27.0    4  140.0  86    2790  ... 1982-01-01      US  ford mustang gl 1982-01-01  33.85
388  44.0    4   97.0  52    2130  ... 1982-01-01  Europe        vw pickup 1982-01-01  33.85
389  32.0    4  135.0  84    2295  ... 1982-01-01      US    dodge rampage 1982-01-01  33.85
390  28.0    4  120.0  79    2625  ... 1982-01-01      US      ford ranger 1982-01-01  33.85
391  31.0    4  119.0  82    2720  ... 1982-01-01      US       chevy s-10 1982-01-01  33.85

[5 rows x 11 columns]

# Resample merged: yearly
yearly = merged.resample('A',on='Date')['mpg','Price'](/ellen567/Data-Science-Notebook/wiki/'mpg','Price').mean()

# Print yearly
print(yearly)

                 mpg  Price
Date                        
1970-12-31  17.689655   3.35
1971-12-31  21.111111   3.56
1972-12-31  18.714286   3.56
1973-12-31  17.100000   3.56
1974-12-31  22.769231  10.11
1975-12-31  20.266667  11.16
1976-12-31  21.573529  11.16
1977-12-31  23.375000  13.90
1978-12-31  24.061111  14.85
1979-12-31  25.093103  14.85
1980-12-31  33.803704  32.50
1981-12-31  30.185714  38.00
1982-12-31  32.000000  33.85

# print yearly.corr()
print(yearly.corr())

            mpg     Price
mpg    1.000000  0.948677
Price  0.948677  1.000000