Module 5 Metagenome preprocessing and assembly - HRGV/Marmics_Metagenomics GitHub Wiki

Lets assemble!

REMEMBER CAPITAL letters indicate names, variables etc. that you can set. Consider clarity over brevity! REMEMBER The read files are the big files that start with MMMSYLT20-XX and end in fastq.gz

You find the illumina read files in /bioinf/transfer/Marmics_SPMG2021/MGs_2020/. The PacBio hifi read file is in /bioinf/transfer/Marmics_SPMG2021/MGs_PacBio/

To be able to use all the tools we need, please activate the conda environment with all the preinstalled software packages

We have installed the key tools, e.g. bbtools, spades, etc. to prevent user based installation hickups.

NOTE: if no conda environment is indicated with (xxxxx) before the prompt, then you need to install conda first. Please add the path to the anaconda installation to your .bashrc by adding the line export PATH=/software/ANACONDA/anaconda3/bin:$PATH. Then resart the terminal and execute conda init bash. If this is successful, please restart the terminal again as indicated. Now you should see a base environment (xxxx) before your prompt. If not, get help from the instructors.

Now you can move into the environment with all the installed tools:

conda activate /bioinf/transfer/Marmics_SPMG2021/SPMG_software_env/conda_test

To make the environment signature in the prompt more readable, you can use

conda  config --set env_prompt '({name})'

Basic scripting

Create a script file

The aim of this exercise is to write script that contains your read cleanup and assembly pipepline. To start, we need to create a script file, e.g. with

touch MYSCRIPT.sh

Edit the script - it is a text file!

There are various text editing programs available, the most common one is nano. You can open a second console to edit in nano while testing the different versions of your script in the other console. http://www.howtogeek.com/howto/42980/the-beginners-guide-to-nano-the-linux-command-line-text-editor/ is a neat guide how to get work done in nano

nano MYSCRIPT.sh

First make your script executable

#!/bin/bash

Run the script

To run a bash script you use the scripting host sh.

sh MYSCRIPT.sh

Comments

Comments can be inserted with the # character. You can also use the # to 'comment out' commands you do not want to run e.g. in a test run.

# this is a comment
# echo 'Hello world' is not executed

PacBio pipeline

This pipline is very easy - two commands are necessary, good to try entry level bash scripts

Assembly of the PacBio reads using flye

The PacBio reads we have are high fidelity reads generated via circular consensus sequensing (CCS). The CCS reads already are preprocessed and flye can directly use them as input. For the use of flye, please read the documentation at https://github.com/fenderglass/Flye/blob/flye/docs/USAGE.md.

DO NOT FORGET - We must only use a subset of the reads for the test runs. For the PacBioreads, we can simply do this by subsampling e.g. the first 1000 reads using a zcat, head, and pigz pipe for on-the-fly decompression, subsampling and compression

zcat PACBIOREADS.fastq.gz | head -n 40000 | pigz -c -p 8 > PACBIOREADS_10k.fq.gz

you can then assemble using flye with

flye --pacbio-hifi PACBIOREADS_10k.fq.gz -o flye_test_10k --threads 8  

this uses 8 CPUs and writes the assembly to a new folder called flye_test_10k

Illumina pipeline

This pipline is still easy - only few commands are necessary, but we are now using variables, everything else is more of the same command after command stuff

Lets be lazy and define some variables

Variables are very handy as they save you typing work and also easily add consistency. You can later on also use variables to make scripts more flexible.

Useful examples for your script could be

lib=1385_A
readlimit=10000
threads=5

If you then want to access your variable you call them with a $ sign in front.

$lib 
bbmap reads=$readlimit

If you want to append something to your variable then put the variable in curly brackets ${VARIABLE} to make sure linux knows which variable you are talking about.

${lib}_trimmed_reads.fastq.gz

This is called variable expansion and much more powerful - read on e.g. at https://en.wikibooks.org/wiki/Bourne_Shell_Scripting/Variable_Expansion

Trim the reads

You trim adapters from the left then from the right, in two separate runs, using BBDuk http://jgi.doe.gov/data-and-tools/bbtools/bb-tools-user-guide/bbduk-guide/.

To be able to trim, we need a reference file for the adapters - the adapters.fa file. You can find the file in your Marmics lecture folder. For convenience, you can create a softlink in your local folder that points to that file.

ln -s /bioinf/transfer/Marmics_SPMG2021/lecture_5_assembly/phix-and-truseq.fa . 

We only use a subset of the reads for the test runs. We do this by using the reads switch with the $readlimit

bbduk.sh ref=phix-and-truseq.fa interleaved=t ktrim=l mink=11 hdist=1 in=./${lib}PUT_YOUR_FORWARD READS_FILENAME_WITHOUT_THE_LIBRARY_NAME_HERE in2=./${lib}PUT_YOUR_REVERSE_READS_FILENAME_WITHOUT_THE_LIBRARY_NAME_HERE out=${lib}_all_fr_ktriml.fq.gz overwrite=t reads=$readlimit threads=$threads

ref = defines the adapter reference file
interleaved=t tells BBXXX tools that the read files are processed as paired end files that are in interleaved format
ktrim=l sets the trimming to the left side of the read
mink=11 sets the minimal size of kmers to search with at the edge of reads
hdist=1 allows one mismatch
in= and in2= define the read files for read1 and read2
out= sets the output file for the trimmed reads, we use one output, this will be in interleaved format
overwrite=t allows BBXXX tools to overwrite files when already present USE WITH CAUTION
reads= sets the number of reads to be processed. This is for read pairs, so setting it to 10000 will make it process 10000 reads from read1 and from read2
threads= defines the number of CPUs that are used

In the second adapter trimming run, we can also get the quality trimming done, here to a minimal phred quality of 2 and we also set a minimal length of 50

bbduk.sh ref=phix-and-truseq.fa interleaved=t ktrim=r trimq=2 qtrim=rl minlength=50 mink=11 hdist=1 in=${lib}_all_fr_ktriml.fq.gz out=${lib}_all_fr_ktrimlr_q2.fq.gz overwrite=t threads=$threads

ktrim=r sets the adapter trimming to the right end
trimq=2 sets a minimal quality of 2 for the quality based trimming
qtrim=rl sets the quality trimming to be performed from both sides of the read
minlength=50 set the minimal length of reads to keep to 50bp
hdist=1 sets the hamming distance to 1

Read error correction and assembly pipeline: SPAdes

We will assemble our reads using SPAdes http://cab.spbu.ru/software/spades/, a multi-k-mer de Bruijn graph assembler with flexible support for Illumina data. SPAdes can occupy the whole server, so we take only 10 processors and 256 GB of RAM at max. The maximum k-mer to assemble with depends on the read length of your library, 99 would be a good choice for the 150bp libraries, 127 for the 250 bp libraries.

It comes in handy to define the output folder that is defined with -o with a variable that you can later reuse.

as=${lib}_SP310 #rename for your assembly name of choice
spades.py -k 21,33,55,77 -o $as --12 ./${lib}_all_fr_ktrimlr_q2.fq.gz -t 10 -m 256 --phred-offset 33  

-k defines the k-mers to use, you have to provide a list separated with commas
-o sets the output directory
--12 sets the paired end reads that come as a interleaved file
-t sets the threads, we use 10
-m sets the Gigabytes of RAM to use, 256 is a safe setting for several of you sharing a single server
--phred-offset sets the quality score encoding to the Sanger typical encoding

OPTIONAL k-mer spectral analysis

This module is optional and could be run before the assembly to filter reads we do not want to assemble - e.g. the host. First, we need to generate a k-mer spectrum. This can be done with BBNorm.

bbnorm.sh -Xmx64g in=${lib}_all_R1.fq.gz in2=${lib}_all_R2.fq.gz  hist=${lib}_all_khist.txt peaks=${lib}_all_khist_peaks.txt threads=12 passes=1

You can now plot the k-mer histogram file on a log/log plot and analyze it, e.g. using R or Calc or Excel. Compare this to the peaks analysis performed by BBDuk.

Then you can remove all reads with k-mer counts below a defined cutoff that you can define with highbindepth.

bbnorm.sh -Xmx64g in=./${lib}_all_fr_ktrimlr_q2.fq.gz lowbindepth=10 highbindepth=25 outhigh=${lib}_all_fr_ktrimlr_q2_k31higher25.fq.gz passes=1 threads=24 overwrite=t

Map the reads to the scaffolds files generated by the asemblers to get coverage+GC stats for gbtoolsLite

Read the BBMap or manual at https://jgi.doe.gov/data-and-tools/bbtools/bb-tools-user-guide/bbmap-guide/ and try to find a solution yourself. Think about PacBio vs illumina reads, they need different versions of bbmap!
If possible, try to use a stringent mapping ID, e.g. 98 %
Make sure to get the following output when using bbmap:

  • bam file = compressed file format for aligned reads
  • covstats = statistics on the coverage