04a 02 Merging Tables With Different Join Types - HannaAA17/Data-Scientist-With-Python-datacamp GitHub Wiki
df1.merge(df2, on='col', how='left')
- used when you do not want lose any information on the left table
- Right Join
- Outer Join
- Merging a table to itself can be useful when you want to compare values in a column to other values in the same column.
# Merge the crews table to itself
crews_self_merged = crews.merge(crews, on='id', how='inner',
suffixes=('_dir','_crew'))
# Create a boolean index to select the appropriate rows
boolean_filter = ((crews_self_merged['job_dir'] == 'Director') &
(crews_self_merged['job_crew'] != 'Director'))
direct_crews = crews_self_merged[boolean_filter]
# Print the first few rows of direct_crews
print(direct_crews.head())
output
<script.py> output:
id job_dir name_dir job_crew name_crew
156 19995 Director James Cameron Editor Stephen E. Rivkin
157 19995 Director James Cameron Sound Designer Christopher Boyes
158 19995 Director James Cameron Casting Mali Finn
160 19995 Director James Cameron Writer James Cameron
161 19995 Director James Cameron Set Designer Richard F. Mays
- if merge on index, set
left_index=True
or 'right_index=True`
# Merge sequels and financials on index id
sequels_fin = sequels.merge(financials, on='id', how='left')
# Self merge with suffixes as inner join with left on sequel and right on id
orig_seq = sequels_fin.merge(sequels_fin, how='inner', left_on='sequel',
right_on='id', right_index=True,
suffixes=('_org','_seq'))
# Add calculation to subtract revenue_org from revenue_seq
orig_seq['diff'] = orig_seq['revenue_seq'] - orig_seq['revenue_org']
# Select the title_org, title_seq, and diff
titles_diff = orig_seq[['title_org','title_seq','diff']]
# Print the first rows of the sorted titles_diff
print(titles_diff.sort_values('diff',ascending=False).head())