Python: Keep Unique Genotypes - SeanBeagle/DataScienceJournal GitHub Wiki

import pandas as pd

CREATE A PANDAS DATAFRAME (`df`) FROM CSV

file_in = 'MAV_H87_matrix806.cfonly.nodupdist.20200608.csv'
df = pd.read_csv(file_in)
print(f'{len(df)} records in DataFrame')

6990 records in DataFrame

IDENTIFY ROWS WHERE...

condition1 : Both samples are from the same patient.
condition2 : Distance between samples is 20 or less.

... sort DataFrame in ascending order by distance

condition1 = df['patient1'] == df['patient2']
condition2 = df['Dist'] <= 20
duplicates = df[condition1 & condition2].sort_values(by=['Dist'])

CATEGORIZE ISOLATES AS `keep` OR `drop`

IF one isolate is in keep THEN drop the other
ELSE IF one isolate is in drop THEN also drop the other
ELSE neither isolate is in keep or drop SO keep the first and drop the other

keep = set()
drop = set()

for i, row in duplicates.iterrows():
    if row['Species1'] in keep:
        drop.add(row['Species2'])
    elif row['Species2'] in keep:
        drop.add(row['Species1'])
    elif row['Species1'] in drop:
        drop.add(row['Species2'])
    elif row['Species2'] in drop:
        drop.add(row['Species1'])
    else:
        keep.add(row['Species1'])
        drop.add(row['Species2'])
        
unique_isolates = duplicates['Species1'].append(duplicates['Species2']).unique()
print(f"Found    {len(unique_isolates)} unique isolates")
print(f'Keeping  {len(keep)} isolates')
print(f'Dropping {len(drop)} isolates')

Found    74 unique isolates
Keeping  30 isolates
Dropping 44 isolates

KEEP ROWS IN ORIGINAL DATAFRAME WHERE...

condition1 : Species1 is not in drop set.
condition2 : Species2 is not in drop set.

condition1 = ~df['Species1'].isin(drop)
condition2 = ~df['Species2'].isin(drop)
df2 = df[condition1 & condition2]
print(f'{len(df2)} records in new DataFrame.')

4108 records in new DataFrame.

SAVE DATAFRAME TO CSV

file_out = file_in.replace('.csv', '.FILTERED.csv')
df.to_csv(file_out)

Python: Keep Unique Genotypes - SeanBeagle/DataScienceJournal GitHub Wiki

CREATE A PANDAS DATAFRAME (df) FROM CSV

IDENTIFY ROWS WHERE...

CATEGORIZE ISOLATES AS keep OR drop

KEEP ROWS IN ORIGINAL DATAFRAME WHERE...

SAVE DATAFRAME TO CSV

CREATE A PANDAS DATAFRAME (`df`) FROM CSV

CATEGORIZE ISOLATES AS `keep` OR `drop`