Metagenomics Sequencing Pre processing - LangilleLab/microbiome_helper GitHub Wiki

Filtering out Contaminants and Low-Quality Sequences with kneadData

kneaddata is a helpful wrapper script for a number of pre-processing tools, including Bowtie2 to screen out contaminant sequences and Trimmomatic to exclude low-quality sequences. We also have written wrapper scripts to run these tools (see below), but using kneaddata allows for more flexibility in options.

A basic kneaddata command can be run like this (the \ characters just break the command over multiple lines to make it easier to read):

kneaddata -i sample_R1.fastq -i sample_R2.fastq -o kneaddata_out \
-db /home/shared/bowtiedb/GRCh38_PhiX --trimmomatic /usr/local/prg/Trimmomatic-0.36/ \
-t 4 --trimmomatic-options "SLIDINGWINDOW:4:20 MINLEN:50" \
--bowtie2-options "--very-sensitive --dovetail" --remove-intermediate-output

The options above are:

  • -i Input FASTQ file. This option is given twice for paired-end data.
  • -o Output directory.
  • -db Path to bowtie2 database.
  • --trimmomatic Path to Trimmomatic folder containing jarfile.
  • -t 4 Number of threads to use (4 in this case).
  • --trimmomatic-options Options for Trimmomatic to use, in quotations ("SLIDINGWINDOW:4:20 MINLEN:50" in this case). See the Trimmomatic website for more options.
  • --bowtie2-options Options for bowtie2 to use, in quotations. The options "--very-sensitive" and "--dovetail" in this set alignment parameters to be very sensitive and sets cases where mates extend past each other to be concordant (i.e. they will be called as contaminants and be excluded).
  • --remove-intermediate-output Intermediate files, including large FASTQs, will be removed.

To run an example on all FASTQs on the Virtual Box Image you can use a command like this:

parallel -j 1 --link 'kneaddata -i {1} -i {2} -o kneaddata_out/ \
-db /home/shared/bowtiedb/GRCh38_PhiX --trimmomatic /usr/local/prg/Trimmomatic-0.36/ \
-t 4 --trimmomatic-options "SLIDINGWINDOW:4:20 MINLEN:50" \
--bowtie2-options "--very-sensitive --dovetail" --remove-intermediate-output' \
 ::: raw_data/*_R1.fastq ::: raw_data/*_R2.fastq

As always with parallel commands, there are options specific to parallel and specific to the program being run (kneaddata in this case). It's a good idea to use the parallel option --dry-run when running large commands like this to double-check that the commands are being run as you'd like (e.g. the R1 and R2 FASTQs are being paired correctly). This option will cause the commands to be printed to screen without running them.

Below is the same command as above, but with the kneaddata command being run collapsed to "KNEADDATA".

parallel -j 1 --link 'KNEADDATA' ::: raw_data/*_R1.fastq ::: raw_data/*_R2.fastq

The parallel options specified are -j 1 to specify that one job should be run at a time and --link to specify that two different sets of input files will be provided. These two sets of input files are the ::: raw_data/*_R1.fastq and ::: raw_data/*_R2.fastq at the end of the command. Since --link was specified each parallel command will run the files in set1 and set2 together (specified as {1} and {2} in the full command). Importantly, if you do not set the --link option all the files will just be looped over as one set.

If you have stitched reads or single-end data you can pre-process your data using this command instead:

parallel -j 1 'kneaddata -i {} -o kneaddata_out/ \
-db /home/shared/bowtiedb/GRCh38_PhiX --trimmomatic /usr/local/prg/Trimmomatic-0.36/ \
-t 4 --trimmomatic-options "SLIDINGWINDOW:4:20 MINLEN:50" \
--remove-intermediate-output' \
 ::: raw_data/*.fastq 

You can then get a summary table of reads that passed at each pre-processing step per sample with this command:

kneaddata_read_count_table --input kneaddata_out --output kneaddata_read_counts.txt

Wrapper script for Bowtie2

Our run_contaminant_filter.pl script wraps Bowtie2 to screen out human sequences. You can use the script like so:

run_contaminant_filter.pl -p 4 -o screened_reads/ stitched_reads/*.assembled*

Note that the GRCh38_PhiX (or whatever reference you're using) Bowtie2 index files need to be in /home/shared/bowtiedb/GRCh38_PhiX and bowtie2 needs to be in your PATH if you want to filter out human reads.

Alternatively you could use run_deconseq.pl instead, but it is much slower and the required databases are no longer available online so we have been unable to add them to the Virtual Box Image.

Wrapper script for Trimmomatic

run_trimmomatic.pl is a wrapper script that will run Trimmomatic on specified FASTQs. This script will automatically identify forward and reverse FASTQ pairs from the filenames. Note: Trimmomatic assumes that input forward and reverse reads are in the same order! See an example command below, you can type run_trimmomatic.pl -h to see all the options.

run_trimmomatic.pl -l 5 -t 5 -r 15 -w 4 -m 70 -j /usr/local/prg/Trimmomatic-0.36/trimmomatic-0.36.jar \
--thread 1 -o trimmomatic_filtered screened_reads/*fastq