Sample sheet - core-unit-bioinformatics/knowledge-base GitHub Wiki
author | date | tags |
---|---|---|
PE | 2023-03-05 | cubi, internal, convention, rule, policy, standard |
Every workflow of the CUBI must accept its input data in form of a
sample sheet. In a Snakemake workflow, the sample sheet must be supplied
via the config parameter --config samples=SAMPLE-SHEET.tsv
.
A sample sheet file must adhere to the following rules:
- plain (text) table file
- the charset should be as limited as possible; ideally ASCII
- tab-separated values
- file extension:
.tsv
- first row must be the header
- the header must contain the column name
sample
- an EAV sample sheet (see below) must only
contain three columns:
sample
,key
andvalue
- the header must contain the column name
- column names must not contain whitespace
- values that contain whitespace must be double-quoted:
""
- empty fields are strongly discouraged and should be
avoided by explicitly setting
n/a
A sample sheet is usually created manually or is provided by the client (if so, it should be sanity-checked). There are two ways of layouting a sample sheet.
The common case that lists all sample-related data and information in a single row per sample. The column names in the header row indicate what information is given in the respective column. Example:
sample sample_age sample_sex hifi_reads illumina_reads
sample1 40-50 male path-to-reads path-to-reads
sample2 50-60 female path-to-reads path-to-reads
Reading a row-oriented sample sheet should not require
any postprocessing. Example in Python pandas
:
import pandas as pd
samples = pd.read_csv(
"path-to-sheet/samples.tsv",
sep="\t",
header=0
)
If the sample information is extensive, i.e., requiring many
columns to be complete, or contains lengthy values (e.g.,
many absolute paths to input files), an EAV-based (triplestore)
sample sheet can be easier to read for humans. An EAV sample
sheet only contains three columns: sample<TAB>key<TAB>value
sample key value
sample1 sample_age 40-50
sample1 sample_sex male
sample1 hifi_reads lengthy-absolute-path-to-reads
sample1 illumina_reads lengthy-absolute-path-to-reads
sample2 sample_age 50-60
sample2 sample_sex female
sample2 hifi_reads super-long-absolute-path-to-reads
sample2 illumina_reads ultra-long-absolute-path-to-reads
This layout requires a simple transformation to turn it
into a standard row-oriented sample table.
Example in Python pandas
:
import pandas as pd
samples = pd.read_csv(
"path-to-sheet/samples.tsv",
sep="\t",
header=0
)
samples = samples.pivot_table(
index="sample",
columns="key",
values="value",
aggfunc=lambda x: x # "identity" function
)
# the above creates a table with
# 'sample' as row index; if that
# is undesired, one can reset the index
samples.reset_index(drop=False, inplace=True)
Fields that specify file system paths to read input data from must be given in any of the following form:
- a folder/directory with multiple input files residing
underneath that top-level folder
- collecting files from a folder/directory is an implementation detail of the respective workflow (possibly generalized in a template utility function)
- recursively collecting input files from a folder/directory must be possible and set as default in the workflow
- a file path for single-file input
- a FOFN file (file of filenames) listing any number of
input file paths
- the file extension of a FOFN file must be
.fofn
- a FOFN file must only contain paths to individual files
- a FOFN file must only contain one file path per line
- using FOFN files as input is the recommended way to accept arbitrarily many input files for scripts. This is an easy way to prevent hitting the character limit for command lines.
- the file extension of a FOFN file must be
- any combination of the above as a comma-separated list