04a 03 Advanced Merging and Concatenating - HannaAA17/Data-Scientist-With-Python-datacamp GitHub Wiki

Filtering joins

semi-join

  • Returns the intersection, similar to an inner join
  • Returns only columns from the left table and not the right.
  • No duplicates
# Merge the non_mus_tck and top_invoices tables on tid
tracks_invoices = non_mus_tcks.merge(top_invoices, on='tid')

# Use .isin() to subset non_mus_tcks to rows with tid in tracks_invoices
top_tracks = non_mus_tcks[non_mus_tcks['tid'].isin(tracks_invoices['tid'])]

# Group the top_tracks by gid and count the tid rows
cnt_by_gid = top_tracks.groupby(['gid'], as_index=False).agg({'tid':'count'})

# Merge the genres table to cnt_by_gid on gid and print
print(cnt_by_gid.merge(genres, on='gid'))

anti-join

  • Returns the left table, excluding the intersection
  • Returns only columns from the left table and not the right
    step 1 : set indicator=True, generate a new column _merge with value equals both or left_only
# Merge employees and top_cust
empl_cust = employees.merge(top_cust, on='srid', 
                            how='left', indicator=True)

step 2 : select columns where _merge==left_only

# Select the srid column where _merge is left_only
srid_list = empl_cust.loc[empl_cust['_merge']=='left_only', 'srid']

step 3 :

# Get employees not working with top customers
print(employees[employees['srid'].isin(srid_list)])

Concatenate dataframes vertically

  • pd.concat([inv_jan, inv_feb, inv_mar])
  • ignore the index: pd.concat([inv_jan, inv_feb, inv_mar]), ignore_index=True), index in this case: 0 ~ (n-1)
  • setting labels to original tables: pd.concat([inv_jan, inv_feb, inv_mar]), ignore_index=False, keys=['jan', 'feb', 'mar'])
  • sort columns : sort=True
  • default: join='outer', set join='inner'
# Concatenate the tables and add keys
inv_jul_thr_sep = pd.concat(
    [inv_jul, inv_aug, inv_sep],
    keys=['7Jul', '8Aug', '9Sep']
)

# Group the invoices by the index keys and find avg of the total column
avg_inv_by_month = inv_jul_thr_sep.groupby(level=0).agg(
    {'total':lambda x: sum(x)/len(x)}
)

# Bar plot of avg_inv_by_month
avg_inv_by_month.plot(kind='bar')
plt.show()
# Use the .append() method to combine the tracks tables
metallica_tracks = tracks_ride.append(
    [tracks_master, tracks_st],
    sort=False
)

# Merge metallica_tracks and invoice_items
tracks_invoices = metallica_tracks.merge(
    invoice_items,
    on='tid', how='inner'
)

# For each tid and name sum the quantity sold
tracks_sold = tracks_invoices.groupby(['tid', 'name']).agg({'quantity':'sum'})

# Sort in decending order by quantity and print the results
print(tracks_sold.sort_values('quantity', ascending=False))

Verifying integrity

Validating merges

  • validate='one_to_one'
  • one_to_many
  • many_to_one
  • many_to_many
# Concatenate the classic tables vertically
classic_18_19 = pd.concat([classic_18, classic_19], ignore_index=True)

# Concatenate the pop tables vertically
pop_18_19 = pd.concat([pop_18, pop_19], ignore_index=True)

# Merge classic_18_19 with pop_18_19
classic_pop = classic_18_19.merge(pop_18_19, on='tid', how='inner')

# Using .isin(), filter classic_18_19 rows where tid is in classic_pop
popular_classic = classic_18_19[classic_18_19['tid'].isin(classic_pop['tid'])]

# Print popular chart
print(popular_classic)

Verifying concatenations

  • .concat(verify_integrity=False):
    • check whether the new concatenated index contains duplicates
    • default value is false