2. Sample file - sneuensc/mapache GitHub Wiki
Preface
Here we introduce some of the nomenclature used in mapache
. If you are familiar with the concepts of sample, library, read group, mapping quality and sequencing platform, you may go straight to the description of the sample file for mapache
.
Sample
In a BAM file, this corresponds to the
SM
tag found in the header.
As the goal of mapache
is to map and filter sequencing reads, the final product includes a BAM file per sample
.
Although from a technical point of view a DNA sample can have many definitions, in the documentation of mapache, a sample usually refers to a single individual.
Yet again, the user is free to define their own samples according to their research questions.
Examples of different definitions of samples
samples: ind1, ind2, ind3
Here, you sequenced the genomes of three individuals, and you want a BAM for each of them. Each sample would correspond to the name or ID of the individual.
samples: tooth1, tooth2
Imagine that two teeth 🦷🦷 were excavated from the same archaeological site, very close to each other. They might belong or not to the same individual. In this case, you might want to have two different BAM files (tooth1.bam and tooth2.bam) for downstream analyses, representing one tooth each.
samples: Summer2012, Winter2021
Let's say that you took a water sample from a lake in Summer 2012, and then again in Winter 2021. Now, you are wondering if the one (or more) specific microbe that was present in 2012 is still there in 2021. In this case, you would like to get a BAM file for each time point, probably called Summer2012.bam and Winter2021.bam
Library
In a BAM file, this corresponds to the
LB
tag found in the header.
The best way to know about the library building process and how many libraries were built is to ask your lab manager.
Once the biological sample is taken and the DNA is extracted and purified, it is time to build sequencing libraries. Usually one library is built and sequenced per sample. However, in many ancient DNA labs, it is common practice to build more than one library for different reasons (a protocol was updated, the researcher needed to sequence more DNA, the quality of the initial library was not good enough, etc.).
When is this information relevant/critical?
This depends on the project type and research questions.
If you are interested in assessing the quality of different libraries, then it is important to know which FASTQ files correspond to which libraries.
For example, in libraries built from ancient samples, one might need to have a closer look at the yield, duplication rate, and adapters content per library. More importantly, mapache
is capable of identifying, marking or removing duplicates (via picardtools) per library specified in the sample input file.
On the other hand, while working with fresh DNA samples (like saliva), as the quality of this material differs from that of degraded samples, some researchers might be willing to accept a few duplicated reads in their BAM files, considering that identifying duplicates is a time-consuming step.
ID
In a BAM file from mapache, this corresponds to the
RG
tag found in the header to specify read groups.
Finally, we describe the ID label.
Once a library has been built, it can be sequenced once or more times. Sometimes, even if it was sequenced only once, you might receive multiple FASTQ files for a single sequencing run.
In mapache
, the ID refers to an identifier (defined by the user) that will be used to track a single (or a pair, for paired-end data) FASTQ file.
Examples
Assume that DNA was extracted for a museum's specimen, labelled as museum_139. Two libraries were prepared from this sample (lib1 and lib2), and they were sequenced on Illumina platforms. The library lib1 is a single-end library, and lib2 is a paired-end library. Each library was sequenced twice, and the sequencing center delivered the following files:
museum_139_lib1_S1_L001_R1.fastq.gz
museum_139_lib1_S1_L002_R1.fastq.gz
museum_139_lib2_S1_L001_R1.fastq.gz, museum_139_lib2_S1_L001_R2.fastq.gz
museum_139_lib2_S1_L002_R1.fastq.gz, museum_139_lib2_S1_L002_R2.fastq.gz
The idea of assigning an ID to the (pairs of) FASTQ files in mapache
is to easily keep track of them during their processing. Thus, we re commend to set meaningful IDs for the files.
In the example above, the user could define different types of IDs; for example, labelling the files by sequencing round
SM LB ID Data1 Data2
museum_139 lib1 round1 museum_139_lib1_S1_L001_R1.fastq.gz NULL
museum_139 lib1 round2 museum_139_lib1_S1_L002_R1.fastq.gz NULL
museum_139 lib2 round1 museum_139_lib2_S1_L001_R1.fastq.gz museum_139_lib2_S1_L001_R2.fastq.gz
museum_139 lib2 round2 museum_139_lib2_S1_L002_R1.fastq.gz museum_139_lib2_S1_L002_R2.fastq.gz
they could also be labelled with a simple suffix:
SM LB ID Data1 Data2
museum_139 lib1 lib1_1 museum_139_lib1_S1_L001_R1.fastq.gz NULL
museum_139 lib1 lib1_2 museum_139_lib1_S1_L002_R1.fastq.gz NULL
museum_139 lib2 lib2_1 museum_139_lib2_S1_L001_R1.fastq.gz museum_139_lib2_S1_L001_R2.fastq.gz
museum_139 lib2 lib2_2 museum_139_lib2_S1_L002_R1.fastq.gz museum_139_lib2_S1_L002_R2.fastq.gz
In this sense, the ID can take many values as long as they are meaningful to the user. The only condition is that the IDs must be unique within a specific library of a sample. In the example above, it would not be allowed to set lib1_1 and lib1_1 for the two files belonging to lib1.
What is the sample file?
The sample file is the most important specification as it lists all fastq files
to map and their aggregation into libraries
and samples
.
In addition, it states the minimum mapping quality to retain reads, and it specifies the sequencing platform from which the reads were obtained.
The name of this file has to be specified in the config file.
How to create your sample file?
The sample file is a plain text file that contains 6 or 7 columns (for single- and paired-end data, respectively). The columns have to be separated by spaces or tabs.
Example of a sample file for single-end libraries:
SM LB ID Data
ind1 a_L2 a_L2_R1_001 reads/a_L2_R1_001.fastq.gz
ind1 a_L2 a_L2_R1_002 reads/a_L2_R1_002.fastq.gz
ind1 b_L2 b_L2_R1_001 reads/b_L2_R1_001.fastq.gz
ind1 b_L2 b_L2_R1_002 reads/b_L2_R1_002.fastq.gz
Example for a sample file for paired-end libraries.
For a mix of paired-end and single-end libraries, you should use the paired-end format and indicate NULL
in the column corresponding to the second fastq file (Data2
).
SM LB ID Data1 Data2
ind1 a_L2 a_L2_R1_001 reads/a_L2_R1_001.fastq.gz reads/a_L2_R2_001.fastq.gz
ind1 a_L2 a_L2_R1_002 reads/a_L2_R1_002.fastq.gz reads/a_L2_R2_001.fastq.gz
ind1 b_L2 b_L2_R1_002 reads/b_L2_R1_002.fastq.gz NULL
In the first example, four fastq files will be mapped. They were generated from two different libraries (here, labelled as a_L2
and b_L2
) from a single sample (ind1
). The reads will be mapped and retained if the mapping quality is above 30 (MAPQ
column).
In the second example, there is still only one sample (ind1
), and two libraries, sequenced in paired-end (a_L2
) and single-end (b_L2
) mode.
The columns SM
, LB
, ID
and PL
will be used to annotate the header of the BAM files produced (SM, LB, RG and PL tags, respectively).
The columns of the sample file are:
- SM: Sample name. Libraries are merged according to this name.
- LB: Library name. Fastq files are merged according to this name.
- ID: An ID for the fastq library (examples: id1, fq_1, ind1_lib1_fq2, etc.)
- Data (single-end format): Path to the fastq file. The file may be gzipped or not. Path may be absolute or relative to the working directory.
- Data1 (paired-end format): Path to the forward fastq file (R1) for paired-end data or the fastq file for single-end data. The file may be gzipped or not. Path may be absolute or relative to the working directory.
- Data2 (paired-end format): Path to the reverse fastq file (R2) for paired-end data or
NULL
for single-end data. The file may be gzipped or not. Path may be absolute or relative to the working directory.
Please note
- The order of the columns is free, but the column names are specific.
ID
names have to be unique within the same library (LB
).- Names in
ID
,LB
andSM
may be anything, but may not contain points ('.') - Commented lines (
#
) are ignored.
Remote files
Mapache supports fastq files defined as an ftp download link (e.g., from ENA). The files are downloaded automatically and stored, also if temporal files are set to be removed. If an additional md5sum is specified (additional column, MD5
(SE reads), MD5_1
/Md5_2
(PE reads)) the downloads are tested for completeness:
SM LB ID Data MD5
ind1 a_L2 a_L2_R1_001 reads/a_L2_R1_001.fastq.gz
ind2 ftp_lib ftp_id ftp://ftp.sra.ebi.ac.uk/vol1/fastq/ERR106/095/ERR10675895/ERR10675895.fastq.gz 06a3243190c072ea4dce55b8fecb7e8
What is next?
You need to edit your config file (config/config.yml
) and indicate the path to your samples file.
Assuming you saved this file as my_samples.txt
, the original config file has to be modified from this:
sample_file: config/samples.tsv
to this:
sample_file: my_samples.txt