04 03 Merging Data - ellen567/Data-Science-Notebook GitHub Wiki
pd.merge()
extend pd.concat()
's ability to align rows using multiple columns rather than indexes
Merging DataFrames
pd.merge(population,cities)
: merge on all columns appearing both dataframes
Merging on specific columns
merge_by_city = pd.merge(revenue,managers,on='city')
Merging on columns with non-matching labels
# Merge revenue & managers on 'city' & 'branch': combined
combined = pd.merge(revenue, managers, left_on ='city', right_on = 'branch')
city branch_id state revenue
0 Austin 10 TX 100
1 Denver 20 CO 83
2 Springfield 30 IL 4
3 Mendocino 47 CA 200
branch branch_id state manager
0 Austin 10 TX Charlers
1 Denver 20 CO Joel
2 Mendocino 47 CA Brett
3 Springfield 31 MO Sally
# output:
city branch_id_x state_x revenue branch branch_id_y state_y manager
0 Austin 10 TX 100 Austin 10 TX Charlers
1 Denver 20 CO 83 Denver 20 CO Joel
2 Springfield 30 IL 4 Springfield 31 MO Sally
3 Mendocino 47 CA 200 Mendocino 47 CA Brett
Using suffixes
pd.merge(bronze, gold, on=['NOC', 'Country'], suffixes['_bronze', '_gold']
Joining DataFrames
df1.append(df2)
: stacking verticallypd.contact([df1, df2])
:- stacking vertically or horizontally
- simple inner/outer joins on indexes
df1.join(df2)
: inner/outer/left/right joins on indexespd.merge([df1,df2])
: many joins on multiple columns, default=inner
Left & right merging on multiple columns
# Merge revenue and sales: revenue_and_sales
revenue_and_sales = pd.merge(revenue,sales,how='right',on=['city','state'])
# Print revenue_and_sales
print(revenue_and_sales)
# Merge sales and managers: sales_and_managers
sales_and_managers = pd.merge(sales,managers,how="left",left_on=['city','state'],right_on=['branch','state'])
# Print sales_and_managers
print(sales_and_managers)
city branch_id state revenue units
0 Austin 10.0 TX 100.0 2
1 Denver 20.0 CO 83.0 4
2 Springfield 30.0 IL 4.0 1
3 Mendocino 47.0 CA 200.0 1
4 Springfield NaN MO NaN 5
city state units branch branch_id manager
0 Mendocino CA 1 Mendocino 47.0 Brett
1 Denver CO 4 Denver 20.0 Joel
2 Austin TX 2 Austin 10.0 Charlers
3 Springfield MO 5 Springfield 31.0 Sally
4 Springfield IL 1 NaN NaN NaN
Merging DataFrames with outer join
# Perform the first merge: merge_default
merge_default = pd.merge(sales_and_managers,revenue_and_sales)
# Print merge_default
print(merge_default)
# Perform the second merge: merge_outer
merge_outer = pd.merge(sales_and_managers,revenue_and_sales,how='outer')
# Print merge_outer
print(merge_outer)
# Perform the third merge: merge_outer_on
merge_outer_on = pd.merge(sales_and_managers,revenue_and_sales,how='outer',on=['city','state'])
# Print merge_outer_on
print(merge_outer_on)
city state units branch branch_id manager revenue
0 Mendocino CA 1 Mendocino 47.0 Brett 200.0
1 Denver CO 4 Denver 20.0 Joel 83.0
2 Austin TX 2 Austin 10.0 Charlers 100.0
city state units branch branch_id manager revenue
0 Mendocino CA 1 Mendocino 47.0 Brett 200.0
1 Denver CO 4 Denver 20.0 Joel 83.0
2 Austin TX 2 Austin 10.0 Charlers 100.0
3 Springfield MO 5 Springfield 31.0 Sally NaN
4 Springfield IL 1 NaN NaN NaN NaN
5 Springfield IL 1 NaN 30.0 NaN 4.0
6 Springfield MO 5 NaN NaN NaN NaN
city state units_x branch branch_id_x manager branch_id_y revenue units_y
0 Mendocino CA 1 Mendocino 47.0 Brett 47.0 200.0 1
1 Denver CO 4 Denver 20.0 Joel 20.0 83.0 4
2 Austin TX 2 Austin 10.0 Charlers 10.0 100.0 2
3 Springfield MO 5 Springfield 31.0 Sally NaN NaN 5
4 Springfield IL 1 NaN NaN NaN 30.0 4.0 1
Ordered merges
pd.merge(hardware, software, how='outer').sorted_values('Date')
= pd.merge_ordered(hardware,software)
# Perform the first ordered merge: tx_weather
tx_weather = pd.merge_ordered(austin, houston)
# Print tx_weather
print(tx_weather)
date ratings
0 2016-01-01 Cloudy
1 2016-01-04 Rainy
2 2016-01-17 Sunny
3 2016-02-08 Cloudy
4 2016-03-01 Sunny
# Perform the second ordered merge: tx_weather_suff
tx_weather_suff = pd.merge_ordered(austin, houston, on='date',suffixes=['_aus','_hus'])
# Print tx_weather_suff
print(tx_weather_suff)
date ratings_aus ratings_hus
0 2016-01-01 Cloudy Cloudy
1 2016-01-04 NaN Rainy
2 2016-01-17 Sunny NaN
3 2016-02-08 Cloudy NaN
4 2016-03-01 NaN Sunny
# Perform the third ordered merge: tx_weather_ffill
tx_weather_ffill = pd.merge_ordered(austin, houston, on='date',suffixes=['_aus','_hus'],fill_method='ffill')
# Print tx_weather_ffill
print(tx_weather_ffill)
date ratings_aus ratings_hus
0 2016-01-01 Cloudy Cloudy
1 2016-01-04 Cloudy Rainy
2 2016-01-17 Sunny Rainy
3 2016-02-08 Cloudy Rainy
4 2016-03-01 Cloudy Sunny
Using merge_asof()
- Similar to
pd.merge_ordered()
, thepd.merge_asof()
function will also merge values in order using the on column, but for each row in the left DataFrame, only rows from the right DataFrame whose 'on' column values are less than the left value will be kept.
oil
Date Price
0 1970-01-01 3.35
1 1970-02-01 3.35
2 1970-03-01 3.35
3 1970-04-01 3.35
4 1970-05-01 3.35
auto
mpg cyl displ hp weight accel yr origin name
0 18.0 8 307.0 130 3504 12.0 1970-01-01 US chevrolet chevelle malibu
1 15.0 8 350.0 165 3693 11.5 1970-01-01 US buick skylark 320
2 18.0 8 318.0 150 3436 11.0 1970-01-01 US plymouth satellite
3 16.0 8 304.0 150 3433 12.0 1970-01-01 US amc rebel sst
4 17.0 8 302.0 140 3449 10.5 1970-01-01 US ford torino
# Merge auto and oil: merged
merged = pd.merge_asof(auto,oil,left_on='yr',right_on='Date')
# Print the tail of merged
print(merged.tail())
mpg cyl displ hp weight ... yr origin name Date Price
387 27.0 4 140.0 86 2790 ... 1982-01-01 US ford mustang gl 1982-01-01 33.85
388 44.0 4 97.0 52 2130 ... 1982-01-01 Europe vw pickup 1982-01-01 33.85
389 32.0 4 135.0 84 2295 ... 1982-01-01 US dodge rampage 1982-01-01 33.85
390 28.0 4 120.0 79 2625 ... 1982-01-01 US ford ranger 1982-01-01 33.85
391 31.0 4 119.0 82 2720 ... 1982-01-01 US chevy s-10 1982-01-01 33.85
[5 rows x 11 columns]
# Resample merged: yearly
yearly = merged.resample('A',on='Date')['mpg','Price'](/ellen567/Data-Science-Notebook/wiki/'mpg','Price').mean()
# Print yearly
print(yearly)
mpg Price
Date
1970-12-31 17.689655 3.35
1971-12-31 21.111111 3.56
1972-12-31 18.714286 3.56
1973-12-31 17.100000 3.56
1974-12-31 22.769231 10.11
1975-12-31 20.266667 11.16
1976-12-31 21.573529 11.16
1977-12-31 23.375000 13.90
1978-12-31 24.061111 14.85
1979-12-31 25.093103 14.85
1980-12-31 33.803704 32.50
1981-12-31 30.185714 38.00
1982-12-31 32.000000 33.85
# print yearly.corr()
print(yearly.corr())
mpg Price
mpg 1.000000 0.948677
Price 0.948677 1.000000