04a 03 Advanced Merging and Concatenating - ellen567/Data-Science-Notebook GitHub Wiki
Filtering joins
semi-join
- Returns the intersection, similar to an inner join
- Returns only columns from the left table and not the right.
- No duplicates
# Merge the non_mus_tck and top_invoices tables on tid
tracks_invoices = non_mus_tcks.merge(top_invoices, on='tid')
# Use .isin() to subset non_mus_tcks to rows with tid in tracks_invoices
top_tracks = non_mus_tcks[non_mus_tcks['tid'].isin(tracks_invoices['tid'])]
# Group the top_tracks by gid and count the tid rows
cnt_by_gid = top_tracks.groupby(['gid'], as_index=False).agg({'tid':'count'})
# Merge the genres table to cnt_by_gid on gid and print
print(cnt_by_gid.merge(genres, on='gid'))
anti-join
- Returns the left table, excluding the intersection
- Returns only columns from the left table and not the right
step 1 : setindicator=True
, generate a new column_merge
with value equals both or left_only
# Merge employees and top_cust
empl_cust = employees.merge(top_cust, on='srid',
how='left', indicator=True)
step 2 : select columns where _merge==left_only
# Select the srid column where _merge is left_only
srid_list = empl_cust.loc[empl_cust['_merge']=='left_only', 'srid']
step 3 :
# Get employees not working with top customers
print(employees[employees['srid'].isin(srid_list)])
Concatenate dataframes vertically
pd.concat([inv_jan, inv_feb, inv_mar])
- ignore the index:
pd.concat([inv_jan, inv_feb, inv_mar]), ignore_index=True)
, index in this case: 0 ~ (n-1) - setting labels to original tables:
pd.concat([inv_jan, inv_feb, inv_mar]), ignore_index=False, keys=['jan', 'feb', 'mar'])
- sort columns :
sort=True
- default:
join='outer'
, setjoint='inner'
Verifying integrity
Validating merges
validate='one_to_one'
one_to_many
many_to_one
many_to_many
Verifying concatenations
.concat(verify_integrity=False)
:- check whether the new concatenated index contains duplicates
- default value is
false