2. Sample file - sneuensc/mapache GitHub Wiki

Preface

Here we introduce some of the nomenclature used in mapache. If you are familiar with the concepts of sample, library, read group, mapping quality and sequencing platform, you may go straight to the description of the sample file for mapache.

Sample

In a BAM file, this corresponds to the SM tag found in the header.

As the goal of mapache is to map and filter sequencing reads, the final product includes a BAM file per sample.

Although from a technical point of view a DNA sample can have many definitions, in the documentation of mapache, a sample usually refers to a single individual.

Yet again, the user is free to define their own samples according to their research questions.

Examples of different definitions of samples

samples: ind1, ind2, ind3

Here, you sequenced the genomes of three individuals, and you want a BAM for each of them. Each sample would correspond to the name or ID of the individual.

samples: tooth1, tooth2

Imagine that two teeth 🦷🦷 were excavated from the same archaeological site, very close to each other. They might belong or not to the same individual. In this case, you might want to have two different BAM files (tooth1.bam and tooth2.bam) for downstream analyses, representing one tooth each.

samples: Summer2012, Winter2021

Let's say that you took a water sample from a lake in Summer 2012, and then again in Winter 2021. Now, you are wondering if the one (or more) specific microbe that was present in 2012 is still there in 2021. In this case, you would like to get a BAM file for each time point, probably called Summer2012.bam and Winter2021.bam

Library

In a BAM file, this corresponds to the LB tag found in the header.

The best way to know about the library building process and how many libraries were built is to ask your lab manager.

Once the biological sample is taken and the DNA is extracted and purified, it is time to build sequencing libraries. Usually one library is built and sequenced per sample. However, in many ancient DNA labs, it is common practice to build more than one library for different reasons (a protocol was updated, the researcher needed to sequence more DNA, the quality of the initial library was not good enough, etc.).

When is this information relevant/critical?

This depends on the project type and research questions.

If you are interested in assessing the quality of different libraries, then it is important to know which FASTQ files correspond to which libraries. For example, in libraries built from ancient samples, one might need to have a closer look at the yield, duplication rate, and adapters content per library. More importantly, mapache is capable of identifying, marking or removing duplicates (via picardtools) per library specified in the sample input file.

On the other hand, while working with fresh DNA samples (like saliva), as the quality of this material differs from that of degraded samples, some researchers might be willing to accept a few duplicated reads in their BAM files, considering that identifying duplicates is a time-consuming step.

ID

In a BAM file from mapache, this corresponds to the RG tag found in the header to specify read groups.

Finally, we describe the ID label.

Once a library has been built, it can be sequenced once or more times. Sometimes, even if it was sequenced only once, you might receive multiple FASTQ files for a single sequencing run.

In mapache, the ID refers to an identifier (defined by the user) that will be used to track a single (or a pair, for paired-end data) FASTQ file.

Examples

Assume that DNA was extracted for a museum's specimen, labelled as museum_139. Two libraries were prepared from this sample (lib1 and lib2), and they were sequenced on Illumina platforms. The library lib1 is a single-end library, and lib2 is a paired-end library. Each library was sequenced twice, and the sequencing center delivered the following files:

museum_139_lib1_S1_L001_R1.fastq.gz
museum_139_lib1_S1_L002_R1.fastq.gz
museum_139_lib2_S1_L001_R1.fastq.gz, museum_139_lib2_S1_L001_R2.fastq.gz
museum_139_lib2_S1_L002_R1.fastq.gz, museum_139_lib2_S1_L002_R2.fastq.gz

The idea of assigning an ID to the (pairs of) FASTQ files in mapache is to easily keep track of them during their processing. Thus, we re commend to set meaningful IDs for the files.

In the example above, the user could define different types of IDs; for example, labelling the files by sequencing round

SM            LB      ID        Data1                                  Data2
museum_139    lib1    round1    museum_139_lib1_S1_L001_R1.fastq.gz    NULL
museum_139    lib1    round2    museum_139_lib1_S1_L002_R1.fastq.gz    NULL
museum_139    lib2    round1    museum_139_lib2_S1_L001_R1.fastq.gz    museum_139_lib2_S1_L001_R2.fastq.gz
museum_139    lib2    round2    museum_139_lib2_S1_L002_R1.fastq.gz    museum_139_lib2_S1_L002_R2.fastq.gz

they could also be labelled with a simple suffix:

SM            LB      ID        Data1                                  Data2
museum_139    lib1    lib1_1    museum_139_lib1_S1_L001_R1.fastq.gz    NULL
museum_139    lib1    lib1_2    museum_139_lib1_S1_L002_R1.fastq.gz    NULL
museum_139    lib2    lib2_1    museum_139_lib2_S1_L001_R1.fastq.gz    museum_139_lib2_S1_L001_R2.fastq.gz
museum_139    lib2    lib2_2    museum_139_lib2_S1_L002_R1.fastq.gz    museum_139_lib2_S1_L002_R2.fastq.gz

In this sense, the ID can take many values as long as they are meaningful to the user. The only condition is that the IDs must be unique within a specific library of a sample. In the example above, it would not be allowed to set lib1_1 and lib1_1 for the two files belonging to lib1.

What is the sample file?

The sample file is the most important specification as it lists all fastq files to map and their aggregation into libraries and samples. In addition, it states the minimum mapping quality to retain reads, and it specifies the sequencing platform from which the reads were obtained. The name of this file has to be specified in the config file.

How to create your sample file?

The sample file is a plain text file that contains 6 or 7 columns (for single- and paired-end data, respectively). The columns have to be separated by spaces or tabs.

Example of a sample file for single-end libraries:

SM          LB        ID           Data
ind1        a_L2      a_L2_R1_001  reads/a_L2_R1_001.fastq.gz
ind1        a_L2      a_L2_R1_002  reads/a_L2_R1_002.fastq.gz
ind1        b_L2      b_L2_R1_001  reads/b_L2_R1_001.fastq.gz
ind1        b_L2      b_L2_R1_002  reads/b_L2_R1_002.fastq.gz

Example for a sample file for paired-end libraries.

For a mix of paired-end and single-end libraries, you should use the paired-end format and indicate NULL in the column corresponding to the second fastq file (Data2).

SM          LB        ID           Data1                       Data2
ind1        a_L2      a_L2_R1_001  reads/a_L2_R1_001.fastq.gz  reads/a_L2_R2_001.fastq.gz
ind1        a_L2      a_L2_R1_002  reads/a_L2_R1_002.fastq.gz  reads/a_L2_R2_001.fastq.gz
ind1        b_L2      b_L2_R1_002  reads/b_L2_R1_002.fastq.gz  NULL

In the first example, four fastq files will be mapped. They were generated from two different libraries (here, labelled as a_L2 and b_L2) from a single sample (ind1). The reads will be mapped and retained if the mapping quality is above 30 (MAPQ column).

In the second example, there is still only one sample (ind1), and two libraries, sequenced in paired-end (a_L2) and single-end (b_L2) mode.

The columns SM, LB, ID and PL will be used to annotate the header of the BAM files produced (SM, LB, RG and PL tags, respectively).

The columns of the sample file are:

SM: Sample name. Libraries are merged according to this name.
LB: Library name. Fastq files are merged according to this name.
ID: An ID for the fastq library (examples: id1, fq_1, ind1_lib1_fq2, etc.)
Data (single-end format): Path to the fastq file. The file may be gzipped or not. Path may be absolute or relative to the working directory.
Data1 (paired-end format): Path to the forward fastq file (R1) for paired-end data or the fastq file for single-end data. The file may be gzipped or not. Path may be absolute or relative to the working directory.
Data2 (paired-end format): Path to the reverse fastq file (R2) for paired-end data or NULL for single-end data. The file may be gzipped or not. Path may be absolute or relative to the working directory.

Please note

The order of the columns is free, but the column names are specific.
ID names have to be unique within the same library (LB).
Names in ID, LB and SM may be anything, but may not contain points ('.')
Commented lines (#) are ignored.

Remote files

Mapache supports fastq files defined as an ftp download link (e.g., from ENA). The files are downloaded automatically and stored, also if temporal files are set to be removed. If an additional md5sum is specified (additional column, MD5 (SE reads), MD5_1/Md5_2 (PE reads)) the downloads are tested for completeness:

SM          LB        ID           Data                                                                            MD5
ind1        a_L2      a_L2_R1_001  reads/a_L2_R1_001.fastq.gz
ind2        ftp_lib   ftp_id       ftp://ftp.sra.ebi.ac.uk/vol1/fastq/ERR106/095/ERR10675895/ERR10675895.fastq.gz  06a3243190c072ea4dce55b8fecb7e8

What is next?

You need to edit your config file (config/config.yml) and indicate the path to your samples file. Assuming you saved this file as my_samples.txt, the original config file has to be modified from this:

sample_file: config/samples.tsv

to this:

sample_file: my_samples.txt