Sample sheet - core-unit-bioinformatics/knowledge-base GitHub Wiki

author date tags
PE 2023-03-05 cubi, internal, convention, rule, policy, standard

Sample sheets

Every workflow of the CUBI must accept its input data in form of a sample sheet. In a Snakemake workflow, the sample sheet must be supplied via the config parameter --config samples=SAMPLE-SHEET.tsv.

Format requirements

A sample sheet file must adhere to the following rules:

  1. plain (text) table file
    • the charset should be as limited as possible; ideally ASCII
  2. tab-separated values
  3. file extension: .tsv
  4. first row must be the header
    • the header must contain the column name sample
    • an EAV sample sheet (see below) must only contain three columns: sample, key and value
  5. column names must not contain whitespace
  6. values that contain whitespace must be double-quoted: ""
  7. empty fields are strongly discouraged and should be avoided by explicitly setting n/a

Manually creating samples sheets

A sample sheet is usually created manually or is provided by the client (if so, it should be sanity-checked). There are two ways of layouting a sample sheet.

Row-oriented sample sheet

The common case that lists all sample-related data and information in a single row per sample. The column names in the header row indicate what information is given in the respective column. Example:

sample  sample_age  sample_sex  hifi_reads  illumina_reads
sample1 40-50   male    path-to-reads   path-to-reads
sample2 50-60   female  path-to-reads   path-to-reads

Reading a row-oriented sample sheet should not require any postprocessing. Example in Python pandas:

import pandas as pd

samples = pd.read_csv(
    "path-to-sheet/samples.tsv",
    sep="\t",
    header=0
)

Entity-attribute-value sample sheet

If the sample information is extensive, i.e., requiring many columns to be complete, or contains lengthy values (e.g., many absolute paths to input files), an EAV-based (triplestore) sample sheet can be easier to read for humans. An EAV sample sheet only contains three columns: sample<TAB>key<TAB>value

sample  key value
sample1 sample_age  40-50
sample1 sample_sex  male
sample1 hifi_reads  lengthy-absolute-path-to-reads
sample1 illumina_reads  lengthy-absolute-path-to-reads
sample2 sample_age  50-60
sample2 sample_sex  female
sample2 hifi_reads  super-long-absolute-path-to-reads
sample2 illumina_reads  ultra-long-absolute-path-to-reads

This layout requires a simple transformation to turn it into a standard row-oriented sample table. Example in Python pandas:

import pandas as pd

samples = pd.read_csv(
    "path-to-sheet/samples.tsv",
    sep="\t",
    header=0
)
samples = samples.pivot_table(
    index="sample",
    columns="key",
    values="value",
    aggfunc=lambda x: x  # "identity" function
)
# the above creates a table with
# 'sample' as row index; if that
# is undesired, one can reset the index
samples.reset_index(drop=False, inplace=True)

Specifying input paths

Fields that specify file system paths to read input data from must be given in any of the following form:

  1. a folder/directory with multiple input files residing underneath that top-level folder
    • collecting files from a folder/directory is an implementation detail of the respective workflow (possibly generalized in a template utility function)
    • recursively collecting input files from a folder/directory must be possible and set as default in the workflow
  2. a file path for single-file input
  3. a FOFN file (file of filenames) listing any number of input file paths
    • the file extension of a FOFN file must be .fofn
    • a FOFN file must only contain paths to individual files
    • a FOFN file must only contain one file path per line
    • using FOFN files as input is the recommended way to accept arbitrarily many input files for scripts. This is an easy way to prevent hitting the character limit for command lines.
  4. any combination of the above as a comma-separated list
⚠️ **GitHub.com Fallback** ⚠️