04a 03 Advanced Merging and Concatenating - ellen567/Data-Science-Notebook GitHub Wiki

Filtering joins

semi-join

  • Returns the intersection, similar to an inner join
  • Returns only columns from the left table and not the right.
  • No duplicates
# Merge the non_mus_tck and top_invoices tables on tid
tracks_invoices = non_mus_tcks.merge(top_invoices, on='tid')

# Use .isin() to subset non_mus_tcks to rows with tid in tracks_invoices
top_tracks = non_mus_tcks[non_mus_tcks['tid'].isin(tracks_invoices['tid'])]

# Group the top_tracks by gid and count the tid rows
cnt_by_gid = top_tracks.groupby(['gid'], as_index=False).agg({'tid':'count'})

# Merge the genres table to cnt_by_gid on gid and print
print(cnt_by_gid.merge(genres, on='gid'))

anti-join

  • Returns the left table, excluding the intersection
  • Returns only columns from the left table and not the right
    step 1 : set indicator=True, generate a new column _merge with value equals both or left_only
# Merge employees and top_cust
empl_cust = employees.merge(top_cust, on='srid', 
                            how='left', indicator=True)

step 2 : select columns where _merge==left_only

# Select the srid column where _merge is left_only
srid_list = empl_cust.loc[empl_cust['_merge']=='left_only', 'srid']

step 3 :

# Get employees not working with top customers
print(employees[employees['srid'].isin(srid_list)])

Concatenate dataframes vertically

  • pd.concat([inv_jan, inv_feb, inv_mar])
  • ignore the index: pd.concat([inv_jan, inv_feb, inv_mar]), ignore_index=True), index in this case: 0 ~ (n-1)
  • setting labels to original tables: pd.concat([inv_jan, inv_feb, inv_mar]), ignore_index=False, keys=['jan', 'feb', 'mar'])
  • sort columns : sort=True
  • default: join='outer', set joint='inner'

Verifying integrity

Validating merges

  • validate='one_to_one'
  • one_to_many
  • many_to_one
  • many_to_many

Verifying concatenations

  • .concat(verify_integrity=False):
    • check whether the new concatenated index contains duplicates
    • default value is false