Python: Validate Sample Sheet - SeanBeagle/DataScienceJournal GitHub Wiki

VALIDATING AN ILLUMINA SAMPLE SHEET

import pandas as pd

csv_in = '2020-05-14_NovaSeq.csv'

CREATE A DataFrame CALLED `df` FROM THE CSV

# Create a pandas DataFrame from CSV 
# Formatting all data as strings since all data is qualitative
df = pd.read_csv(csv_in, dtype='str')

# Get rid of the empty "NA" rows that Excel loves to include...
df.dropna(inplace=True)

# Print some summary statistics using "f-strings"
print(f'Rows:   {len(df)}')
print(f'Plates: {df.plate.nunique()}\n')

# Preview the first 5 rows
print(df[:5])

Rows:   384
Plates: 4

  sample-number plate well rawSample-name         sample-name
0             1     1  A01    15-4-SW-2-1         15-4-SW-2-1
1             2     1  A02    15-4-SW-3-2         15-4-SW-3-2
2             3     1  A03    17-100-SW-A  17-100-SW-A-1-1-37
3             4     1  A04    17-100-SW-A  17-100-SW-A-1-3-30
4             5     1  A05    17-100-SW-B  17-100-SW-B-1-1-30

CHECKING SAMPLE COUNT PER PLATE

Each plate should have 96 rows

plates = df['plate'].value_counts()
bad_plates = plates[plates != 96]
try:
    assert len(bad_plates) == 0
    print(f'[PASS] All {len(plates)} plates have 96 samples!')
except AssertionError:
    print(f'[FAIL] Found {len(bad_plates)} of {len(plates)}', 
          f'that do not have 96 samples...')
    print(bad_plates)

[PASS] All 4 plates have 96 samples!

CHECKING FOR DUPLICATE SAMPLE NAMES

sample-name should be unique

duplicates = df[df['sample-name'].duplicated()]
try:
    assert len(duplicates) == 0
    print('[PASS] All sample names are unique!')
except AssertionError:
    print(f'[FAIL] Found {len(duplicates)} duplicate sample names...')
    print(duplicates)

[PASS] All sample names are unique!

COMPARING THE SAMPLE NAME WITH THE RAW SAMPLE NAME

sample-name should start with rawSample-name

bad_rows = df[[row['sample-name'].startswith(row['rawSample-name']) == False 
               for i, row in df.iterrows()]]                                            
try:
    assert len(bad_rows) == 0
    print('[PASS] Every sample-name starts with its rawSample-name!')
except AssertionError:
    print(f'[FAIL] Found {len(bad_rows)} rows', 
          'where sample-name does not contain rawSample-name')

[PASS] Every sample-name starts with its rawSample-name!

CHECKING THE LENGTH OF SAMPLE NAMES

sample-name should be 100 characters or less

bad_rows = df[df['sample-name'].str.len() > 100]
try:
    assert len(bad_rows) == 0
    print('[PASS] All sample names are 100 characters or less!')
except AssertionError:
    print(f'[FAIL] Found {len(bad_rows)} bad rows...\n')
    print(bad_rows)

[PASS] All sample names are 100 characters or less!

CHECKING THE FORMATTING OF SAMPLE NAMES

sample-name should only contain: [a-z], [A-Z], [0-9] or '-'
sample-name should not start or end with '-'

pattern = r'^[a-zA-Z0-9]+[a-zA-Z0-9\-]+[a-zA-Z0-9]$'
bad_rows = df[df['sample-name'].str.match(pattern) == False]
try:
    assert len(bad_rows) == 0
    print('[PASS] All sample names are properly formatted!')
except AssertionError:
    print(f'[FAIL] Found {len(bad_rows)} bad rows...')
    print(bad_rows)

[PASS] All sample names are properly formatted!

Python: Validate Sample Sheet - SeanBeagle/DataScienceJournal GitHub Wiki

VALIDATING AN ILLUMINA SAMPLE SHEET

CREATE A DataFrame CALLED df FROM THE CSV

CHECKING SAMPLE COUNT PER PLATE

CHECKING FOR DUPLICATE SAMPLE NAMES

COMPARING THE SAMPLE NAME WITH THE RAW SAMPLE NAME

CHECKING THE LENGTH OF SAMPLE NAMES

CHECKING THE FORMATTING OF SAMPLE NAMES

Everything appears to be in order...

CREATE A DataFrame CALLED `df` FROM THE CSV