04a 03 Advanced Merging and Concatenating - HannaAA17/Data-Scientist-With-Python-datacamp GitHub Wiki
Filtering joins
semi-join
- Returns the intersection, similar to an inner join
- Returns only columns from the left table and not the right.
- No duplicates
# Merge the non_mus_tck and top_invoices tables on tid
tracks_invoices = non_mus_tcks.merge(top_invoices, on='tid')
# Use .isin() to subset non_mus_tcks to rows with tid in tracks_invoices
top_tracks = non_mus_tcks[non_mus_tcks['tid'].isin(tracks_invoices['tid'])]
# Group the top_tracks by gid and count the tid rows
cnt_by_gid = top_tracks.groupby(['gid'], as_index=False).agg({'tid':'count'})
# Merge the genres table to cnt_by_gid on gid and print
print(cnt_by_gid.merge(genres, on='gid'))
anti-join
- Returns the left table, excluding the intersection
- Returns only columns from the left table and not the right
step 1 : setindicator=True
, generate a new column_merge
with value equals both or left_only
# Merge employees and top_cust
empl_cust = employees.merge(top_cust, on='srid',
how='left', indicator=True)
step 2 : select columns where _merge==left_only
# Select the srid column where _merge is left_only
srid_list = empl_cust.loc[empl_cust['_merge']=='left_only', 'srid']
step 3 :
# Get employees not working with top customers
print(employees[employees['srid'].isin(srid_list)])
Concatenate dataframes vertically
pd.concat([inv_jan, inv_feb, inv_mar])
- ignore the index:
pd.concat([inv_jan, inv_feb, inv_mar]), ignore_index=True)
, index in this case: 0 ~ (n-1) - setting labels to original tables:
pd.concat([inv_jan, inv_feb, inv_mar]), ignore_index=False, keys=['jan', 'feb', 'mar'])
- sort columns :
sort=True
- default:
join='outer'
, setjoin='inner'
# Concatenate the tables and add keys
inv_jul_thr_sep = pd.concat(
[inv_jul, inv_aug, inv_sep],
keys=['7Jul', '8Aug', '9Sep']
)
# Group the invoices by the index keys and find avg of the total column
avg_inv_by_month = inv_jul_thr_sep.groupby(level=0).agg(
{'total':lambda x: sum(x)/len(x)}
)
# Bar plot of avg_inv_by_month
avg_inv_by_month.plot(kind='bar')
plt.show()
# Use the .append() method to combine the tracks tables
metallica_tracks = tracks_ride.append(
[tracks_master, tracks_st],
sort=False
)
# Merge metallica_tracks and invoice_items
tracks_invoices = metallica_tracks.merge(
invoice_items,
on='tid', how='inner'
)
# For each tid and name sum the quantity sold
tracks_sold = tracks_invoices.groupby(['tid', 'name']).agg({'quantity':'sum'})
# Sort in decending order by quantity and print the results
print(tracks_sold.sort_values('quantity', ascending=False))
Verifying integrity
Validating merges
validate='one_to_one'
one_to_many
many_to_one
many_to_many
# Concatenate the classic tables vertically
classic_18_19 = pd.concat([classic_18, classic_19], ignore_index=True)
# Concatenate the pop tables vertically
pop_18_19 = pd.concat([pop_18, pop_19], ignore_index=True)
# Merge classic_18_19 with pop_18_19
classic_pop = classic_18_19.merge(pop_18_19, on='tid', how='inner')
# Using .isin(), filter classic_18_19 rows where tid is in classic_pop
popular_classic = classic_18_19[classic_18_19['tid'].isin(classic_pop['tid'])]
# Print popular chart
print(popular_classic)
Verifying concatenations
.concat(verify_integrity=False)
:- check whether the new concatenated index contains duplicates
- default value is
false