Python: Keep Unique Genotypes - SeanBeagle/DataScienceJournal GitHub Wiki

import pandas as pd

CREATE A PANDAS DATAFRAME (df) FROM CSV

file_in = 'MAV_H87_matrix806.cfonly.nodupdist.20200608.csv'
df = pd.read_csv(file_in)
print(f'{len(df)} records in DataFrame')
6990 records in DataFrame

IDENTIFY ROWS WHERE...

  • condition1 : Both samples are from the same patient.
  • condition2 : Distance between samples is 20 or less.

... sort DataFrame in ascending order by distance

condition1 = df['patient1'] == df['patient2']
condition2 = df['Dist'] <= 20
duplicates = df[condition1 & condition2].sort_values(by=['Dist'])

CATEGORIZE ISOLATES AS keep OR drop

  • IF one isolate is in keep THEN drop the other
  • ELSE IF one isolate is in drop THEN also drop the other
  • ELSE neither isolate is in keep or drop SO keep the first and drop the other
keep = set()
drop = set()

for i, row in duplicates.iterrows():
    if row['Species1'] in keep:
        drop.add(row['Species2'])
    elif row['Species2'] in keep:
        drop.add(row['Species1'])
    elif row['Species1'] in drop:
        drop.add(row['Species2'])
    elif row['Species2'] in drop:
        drop.add(row['Species1'])
    else:
        keep.add(row['Species1'])
        drop.add(row['Species2'])
        
unique_isolates = duplicates['Species1'].append(duplicates['Species2']).unique()
print(f"Found    {len(unique_isolates)} unique isolates")
print(f'Keeping  {len(keep)} isolates')
print(f'Dropping {len(drop)} isolates')
Found    74 unique isolates
Keeping  30 isolates
Dropping 44 isolates

KEEP ROWS IN ORIGINAL DATAFRAME WHERE...

  • condition1 : Species1 is not in drop set.
  • condition2 : Species2 is not in drop set.
condition1 = ~df['Species1'].isin(drop)
condition2 = ~df['Species2'].isin(drop)
df2 = df[condition1 & condition2]
print(f'{len(df2)} records in new DataFrame.')
4108 records in new DataFrame.

SAVE DATAFRAME TO CSV

file_out = file_in.replace('.csv', '.FILTERED.csv')
df.to_csv(file_out)