Python: Keep Unique Genotypes - SeanBeagle/DataScienceJournal GitHub Wiki
import pandas as pd
df
) FROM CSV
CREATE A PANDAS DATAFRAME (file_in = 'MAV_H87_matrix806.cfonly.nodupdist.20200608.csv'
df = pd.read_csv(file_in)
print(f'{len(df)} records in DataFrame')
6990 records in DataFrame
IDENTIFY ROWS WHERE...
condition1
: Both samples are from the same patient.condition2
: Distance between samples is 20 or less.
... sort DataFrame in ascending order by distance
condition1 = df['patient1'] == df['patient2']
condition2 = df['Dist'] <= 20
duplicates = df[condition1 & condition2].sort_values(by=['Dist'])
keep
OR drop
CATEGORIZE ISOLATES AS - IF one isolate is in
keep
THEN drop the other - ELSE IF one isolate is in
drop
THEN also drop the other - ELSE neither isolate is in
keep
ordrop
SO keep the first and drop the other
keep = set()
drop = set()
for i, row in duplicates.iterrows():
if row['Species1'] in keep:
drop.add(row['Species2'])
elif row['Species2'] in keep:
drop.add(row['Species1'])
elif row['Species1'] in drop:
drop.add(row['Species2'])
elif row['Species2'] in drop:
drop.add(row['Species1'])
else:
keep.add(row['Species1'])
drop.add(row['Species2'])
unique_isolates = duplicates['Species1'].append(duplicates['Species2']).unique()
print(f"Found {len(unique_isolates)} unique isolates")
print(f'Keeping {len(keep)} isolates')
print(f'Dropping {len(drop)} isolates')
Found 74 unique isolates
Keeping 30 isolates
Dropping 44 isolates
KEEP ROWS IN ORIGINAL DATAFRAME WHERE...
condition1
: Species1 is not in drop set.condition2
: Species2 is not in drop set.
condition1 = ~df['Species1'].isin(drop)
condition2 = ~df['Species2'].isin(drop)
df2 = df[condition1 & condition2]
print(f'{len(df2)} records in new DataFrame.')
4108 records in new DataFrame.
SAVE DATAFRAME TO CSV
file_out = file_in.replace('.csv', '.FILTERED.csv')
df.to_csv(file_out)