Cleaning raw reads - CGRL-QB3-UCBerkeley/seqCapture GitHub Wiki
Introduction:
This is the fist step of the workflow and is to clean up raw data (fastq.gz), which includes trimming for quality, removing adapters, merging overlapping reads, removing duplicates and reads sourced from contamination. This module not only works for sequence capture data, but also works for WGS and RNAseq data (an example given below)
Paired end reads:
Command and options:
(seqCap) $ seqCapture cleanpe
Usage: seqCapture cleanpe [options]
Options:
-f DIR Path to the folder of raw fastq files
-o DIR Path to the results folder
-c FILE Contaminant file
-m FLOAT percent similarity to contaminant file to
consider a read to be a contaminant
[0.90]
-R INT 1=remove duplicates; 0 = keep all duplicates [1]
Note: for DGE analysis using RNAseq data, keeping
or removing duplicates is still on debate.
-k INT Number of threads needed for bowtie2 alignment
(remove contamination), pigz (compress files),
trimmomatic (trimming adapters and low qual), read
error correction (for de novo assemblies) [4]
-h INT Trimmomatic trimming length cutoff [36]
-q INT Quality trimming threhold [20]
-d CHAR The particular library that you like to process,
if "all" is specified, all libraries in the folder
(-f) will be processed sequenctially [all]
-M INT Also keep unmerged overlapping PE reads?
1 = yes (for DGE)
0 = no (for SNP calling) [0]
-g FLOAT Maximum allowed ratio between the number of
mismatched base pairs and the overlap length [0.1]
-z If z is supplied, use fastQC to evaluate
cleaned sequence reads [null]
#### The above assumes trimming Truseq adapters only
#### For trimming Nextera PE transposase adapters
-n If n is supplied assuming trimming Nextera PE transposase
Adapters ONLY [null]
#### Error correction is only recommended for de novo assemblies ####
#### DO NOT run error correction for variant calling ####
-N INT error correction
0 = no error correction -- default
1 = using bfc (when number of reads is < 30 million)
2 = using Rcorrector (when number of reads is >= 30 million)
[0]
-G INT if N=1 approx genome size (k/m/g allowed)
for transcriptomic data set this to 50m [50m]
#### rescale qualty scores for museum DNA:
-A Sequence are from museum samples [null]
-B FILE a reference genome for alignment (for popgen projects)
-C FOLDER a folder with individual assemblies as references
(for phylogenetic projects)
-D INT Maximum alignment score acceptable for the best alignment.
Recommend use 150-180
(roughly 5-6 mismatches are allowed per read) [180].
-S save DNA damage stats report? [null]
Note: 1) do not do error correction (-N) for museum DNA samples!
Prepare input for the run:
Store fastq reads (e.g. Sample1_L001_R1_001.fastq.gz, Sample1_L001_R2_001.fastq.gz, Sample2_L001_R1_001.fastq.gz, Sample2_L001_R2_001.fastq.gz...) for each and all samples in the same folder "raw_reads_dir". Store sequences of possible source contaminations in a multi-fasta format in "contaminant.fasta".
Raw reads naming convention:
- In order for the module to recognize the input fastq files, they must be named as LibraryName_L00x_R[1|2]_00x.fastq.gz, where x can be any numbers.
Special cases of input libraries:
- Sometimes you may get multiple sets for each forward or reverse reads: Sample1_L001_R1_001.fastq.gz, Sample1_L001_R1_002.fastq.gz, Sample1_L001_R1_003.fastq.gz ... Sample1_L001_R2_001.fastq.gz, Sample1_L001_R2_002.fastq.gz, Sample1_L001_R2_003.fastq.gz ...
- Sometimes you may get the same libraries sequenced in multiple lanes: Sample1_ L001_ R1_001.fastq.gz, Sample1_ L002_ R1_001.fastq.gz, Sample1_ L001_ R2_001.fastq.gz, Sample1_ L002_ R2_001.fastq.gz ...
In the above cases, simply place all these files in "raw_reads_dir" and this script will combine data derived from multiple sets or multiple lanes before deduplication.
Usage examples:
Removing PCR/Optical duplicates (-R 1
); using 10 threads (-k 10
); merging overlapping paired end reads (-M 0
); trimmomatic trimming length cutoff of 50bp (-h 50
); assuming Truseq adapters (without using -n
); base quality trimming threshold of Phred Q20 (-q 20
); evaluating and summarizing quality of cleaned reads using FASTQC (-z
); output cleaned reads to "cleaned_reads_dir":
(seqCapture) $ seqCapture cleanpe -f /path/to/raw_reads_dir -o /path/to/cleaned_reads_dir -c contaminant.fasta -R 1 -q 20 -M 0 -k 10 -h 50 -z
Note that -R 1
, -M 0
, and -q 20
are default values. Therefore the above command can be simply written as:
(seqCapture) $ seqCapture cleanpe -f /path/to/raw_reads_dir -o /path/to/cleaned_reads_dir -c contaminant.fasta -q -k 10 -h 50 -z
If Nextera adapters (-n
) are used and needed to be trimmed:
(seqCapture) $ seqCapture cleanpe -f /path/to/raw_reads_dir -o /path/to/cleaned_reads_dir -c contaminant.fasta -n ...[other desired options]...
For RNAseq reads you may wish to keep unmerged PE reads (-M 1
):
(seqCapture) $ seqCapture cleanpe -f /path/to/raw_reads_dir -o /path/to/cleaned_reads_dir -c contaminant.fasta -M 1 ...[other desired options]...
If read error correction is desired (using bfc (-N 1
) as an example; assuming 25m target size including targeted and flanking sequences (-G 25m
)):
(seqCapture) $ seqCapture cleanpe -f /path/to/raw_reads_dir -o /path/to/cleaned_reads_dir -c contaminant.fasta -N 1 -G 25m ...[other desired options]...
For sequence data derived from museum samples (-A
), this module can call mapDamage2.0 to recalibrate base quality of the cleaned reads. A reference genome (-B ref.fa
) is required for alignment. Note this part of function has not been rigorously tested
(seqCapture) $ seqCapture cleanpe -f /path/to/raw_reads_dir -o /path/to/cleaned_reads_dir -c contaminant.fasta -A -B ref.fa
Output
This module creates a diretory named "pre-clean" inside "raw_reads_dir". In "pre-clean" there is another diretory "original". Original raw fastq reads for each sample are slightly reformatted, with renamed sequence headers and stored in "original" and their deduplicated copies are stored in "pre-clean".
The resulting cleaned reads are stored in "cleaned_reads_dir", four files are generated per sample. For example for Sample1:
(seqCapture) $ ls cleaned_reads_dir/
Sample1_1_final.fq (cleaned forward reads that can not be merged)
Sample1_2_final.fq (cleaned reverse reads that can not be merged)
Sample1_u_final.fq (merged overlapping cleaned PE reads)
Sample1.contam.out (reads likely sourced from contaminations and depleted from the above files)
If FASTQC is called then an additional diretory "evaluation" can be seen in "cleaned_reads_dir". FASTQC report for each fastq file is stored in this diretory
Single end reads:
Command and options:
(seqCapture) $ seqCapture cleanse
Usage: seqCapture cleanse [options]
Options:
-f DIR Path to the folder of raw fastq files
-o DIR Path to the results folder
-c FILE Contaminant file
-m FLOAT percent similarity to contaminant file to
consider a read to be a contaminant
[0.90]
-k INT Number of threads needed for bowtie2 alignment
(remove contamination), skewer(trimming adapters),
trimmomatic (trimming adapters and low qual), read
error correction (for de novo assemblies) [4]
-q INT Quality trimming threhold [20]
-d CHAR The particular library that you like to process,
if "all" is specified, all libraries in the folder
(-f) will be processed sequenctially [all]
-h INT Trimmomatic trimming length cutoff [36]
-z If z is supplied, use fastQC to evaluate
cleaned sequence reads [null]
#### Error correction is only recommended for de novo assemblies ####
#### DO NOT run error correction for alignment based analyses ####
-N INT error correction
0 = no error correction -- default
1 = using bfc (when number of reads is < 30 million)
2 = using Rcorrector (when number of reads is >= 30 million)
[0]
-G INT if N=1 approx genome size (k/m/g allowed)
for transcriptomic data set this to 50m [50m]
Prepare input for the run:
Store fastq reads (e.g. Sample1_L001_R1_001.fastq.gz, Sample2_L001_R1_001.fastq.gz, Sample3_L001_R1_001.fastq.gz,...) for each and all samples in the same folder "raw_reads_dir". For other input files please refer to paired "Prepare input for the run" for the PE reads.
Usage examples:
The usage is similar to those that are shown above in processing PE reads. However, reads deduplication and base quality recalibration for museum DNA is not applicable for SE reads. Nextera adapter trimming is not implemented in this module as it is rarely used.
A simple example: here we use most default values for various options; we also would like to generate quality report of cleaned reads using FASTQC (-z
):
(seqCapture) $ seqCapture cleanse -f /path/to/raw_reads_dir -o /path/to/cleaned_reads_dir -c contaminant.fasta -z
Output
This module creates a diretory named "pre-clean" inside "raw_reads_dir". Original raw fastq reads for each sample are slightly reformatted, with renamed sequence headers and stored in "pre-clean"
The resulting cleaned reads for each sample are stored in "cleaned_reads_dir", two files are generated per sample. For example for Sample1:
(seqCapture) $ ls resultdir/
Sample1_1_final.fq (cleaned forward reads)
Sample1.contam.out (reads likely sourced from contaminations and depleted from Sample1_1_final.fq)
If FASTQC is called then an additional diretory "evaluation" can be seen in "cleaned_reads_dir". FASTQC report for each fastq file is stored in this diretory