Basic Metatranscriptomics Workflow - meyermicrobiolab/Meyer_Lab_Resources GitHub Wiki

For each bash script, it must be saved as a .bash file [NOTE from JLM: THIS IS NOT TRUE - it just needs to a text file. The file ending is flexible. I prefer .txt for ease in opening the script to edit. Other possible endings are .sh and .bash. I avoid those because if I download the script and click on it to open, the default on my mac asks me if I want to run the script, which I don't.] and placed into the Hipergator directory you wish to perform the job in. From the terminal, you must run the following to start a job:

 sbatch [path to script]

See HiPerGator Cheat Sheet for more details.

Steps

Quality Control

Step 1 is always Quality Control

We use fastqc to assess quality and Trimmomatic to quality filter and remove Illumina adaptor sequences on the raw reads (ICBR does NOT remove adaptors), respectively. To begin, first run fastqc to assess quality and get the number of reads in your dataset before trimming adapters. Note that I collect qc data on the raw reads at the beginning and on trimmomatic processed reads at the end. Best practice is to keep all of this qc data, as you should be reporting, at a minimum, how many reads went into your assemblies.

#!/bin/bash
#SBATCH --job-name=fastqc%j
#SBATCH --output=fastqc%j.log
#SBATCH --error=fastqc%j.err
#SBATCH --mail-type=BEGIN,END,FAIL
#SBATCH --mail-user [ufl email]
#SBATCH --ntasks=1
#SBATCH --cpus-per-task=5
#SBATCH --mem-per-cpu=5gb
#SBATCH --time=4-00:00:00
#SBATCH --account=juliemeyer
#SBATCH --qos=juliemeyer-b
export OMP_NUM_THREADS=5

pwd; hostname; date
module load fastqc
fastqc [path to directory]*.fastq.gz -t 5

Next, you'll need to run multi qc to summarize the qc reads into one .html file. You can run multi qc in development mode. To log into development mode, run:

module load ufrc
srundev --time=1:00:00

Once in the development node change your directory with all the fastqc reports, then run the multi qc code in the terminal:

module load multiqc
multiqc .

This will give you a multiqc report in the same directory you were working in. Save this file. Note that I collect qc data on the raw reads at the beginning here, and on Trimmomatic processed reads at the end. Best practice is to keep all of this qc data, as you should be reporting, at a minimum, how many reads went into your assemblies.

Note: if you want to leave the development node and return to the login mode to submit a bash script to SLURM, simply type "exit" and hit return on the command line.

After this, run Trimmomatic on your fastq.gz files:

#SBATCH --job-name=trim_ssid
#SBATCH --output=trim_ssid%j.log
#SBATCH --error=trim_ssid%j.err
#SBATCH --mail-type=BEGIN,END,FAIL
#SBATCH --mail-user [email protected]
#SBATCH --ntasks=1
#SBATCH --cpus-per-task=5
#SBATCH --mem-per-cpu=5gb
#SBATCH --time=4-00:00:00
#SBATCH --account=juliemeyer
#SBATCH --qos=juliemeyer-b

export OMP_NUM_THREADS=5

module load trimmomatic

trimmomatic PE -phred33 -basein /blue/juliemeyer/share/SSID_RNAseq_rawreads/TCP102_S26_L002_R1_001.fastq.gz TCP102_R1_trimmmed_Q20_1P.fastq.gz TCP102_R1_trimmmed_Q20_1U.fastq.gz TCP102_R2_trimmmed_Q20_2P.fastq.gz TCP102_R2_trimmmed_Q20_2U.fastq.gz ILLUMINACLIP:${HPC_TRIMMOMATIC_ADAPTER}/TruSeq3-SE.fa:2:30:10 LEADING:3 TRAILING:3 SLIDINGWINDOW:4:15 MINLEN:36
trimmomatic PE -phred33 -basein /blue/juliemeyer/share/SSID_RNAseq_rawreads/TCP101_S25_L002_R1_001.fastq.gz TCP101_R1_trimmmed_Q20_1P.fastq.gz TCP101_R1_trimmmed_Q20_1U.fastq.gz TCP101_R2_trimmmed_Q20_2P.fastq.gz TCP101_R2_trimmmed_Q20_2U.fastq.gz ILLUMINACLIP:${HPC_TRIMMOMATIC_ADAPTER}/TruSeq3-SE.fa:2:30:10 LEADING:3 TRAILING:3 SLIDINGWINDOW:4:15 MINLEN:36
trimmomatic PE -phred33 -basein /blue/juliemeyer/share/SSID_RNAseq_rawreads/TCP100_S24_L002_R1_001.fastq.gz TCP100_R1_trimmmed_Q20_1P.fastq.gz TCP100_R1_trimmmed_Q20_1U.fastq.gz TCP100_R2_trimmmed_Q20_2P.fastq.gz TCP100_R2_trimmmed_Q20_2U.fastq.gz ILLUMINACLIP:${HPC_TRIMMOMATIC_ADAPTER}/TruSeq3-SE.fa:2:30:10 LEADING:3 TRAILING:3 SLIDINGWINDOW:4:15 MINLEN:36
trimmomatic PE -phred33 -basein /blue/juliemeyer/share/SSID_RNAseq_rawreads/TCP99_S23_L002_R1_001.fastq.gz TCP99_R1_trimmmed_Q20_1P.fastq.gz TCP99_R1_trimmmed_Q20_1U.fastq.gz TCP99_R2_trimmmed_Q20_2P.fastq.gz TCP99_R2_trimmmed_Q20_2U.fastq.gz ILLUMINACLIP:${HPC_TRIMMOMATIC_ADAPTER}/TruSeq3-SE.fa:2:30:10 LEADING:3 TRAILING:3 SLIDINGWINDOW:4:15 MINLEN:36
trimmomatic PE -phred33 -basein /blue/juliemeyer/share/SSID_RNAseq_rawreads/TCP98_S22_L002_R1_001.fastq.gz TCP98_R1_trimmmed_Q20_1P.fastq.gz TCP98_R1_trimmmed_Q20_1U.fastq.gz TCP98_R2_trimmmed_Q20_2P.fastq.gz TCP98_R2_trimmmed_Q20_2U.fastq.gz ILLUMINACLIP:${HPC_TRIMMOMATIC_ADAPTER}/TruSeq3-SE.fa:2:30:10 LEADING:3 TRAILING:3 SLIDINGWINDOW:4:15 MINLEN:36
trimmomatic PE -phred33 -basein /blue/juliemeyer/share/SSID_RNAseq_rawreads/TCP97_S21_L002_R1_001.fastq.gz TCP97_R1_trimmmed_Q20_1P.fastq.gz TCP97_R1_trimmmed_Q20_1U.fastq.gz TCP97_R2_trimmmed_Q20_2P.fastq.gz TCP97_R2_trimmmed_Q20_2U.fastq.gz ILLUMINACLIP:${HPC_TRIMMOMATIC_ADAPTER}/TruSeq3-SE.fa:2:30:10 LEADING:3 TRAILING:3 SLIDINGWINDOW:4:15 MINLEN:36
trimmomatic PE -phred33 -basein /blue/juliemeyer/share/SSID_RNAseq_rawreads/TCP96_S20_L002_R1_001.fastq.gz TCP96_R1_trimmmed_Q20_1P.fastq.gz 
#and so on with your own samples

Please note that Trimmomatic needs both specific input files and output files for each sample and sequence. You should have two input files for each sample. For paired-end data, two input files, and 4 output files are specified (2 for the 'paired' output where both reads survived the processing, and 2 for corresponding 'unpaired' output where a read survived, but the partner read did not). Name the input forward file using the -basein flag where the reverse file can be determined automatically. The second file is determined by looking for common patterns of file naming, and changing the appropriate character to reference the reverse file. You must explicitly name all output files.

After this, you should move your trimmed files into a different folder/directory. This is simply done by using the mv (move) command. You will then run fastqc and generate a multiqc html file for your paired, trimmed sequences. Save this data.

Step 2. Remove rRNA from our data

Because we are doing metatranscriptomics, we want to be analyzing only mRNA. To do this, we use SortMeRNA, which filters, maps, and clusters our reads to a given sequence database. The purpose of SortMeRNA is to filter rRNA from metatranscriptomic data. SortMeRNA takes as input files of reads (fasta, fastq, fasta.gz, fastq.gz) and one or multiple rRNA database file(s), and sorts apart aligned and rejected reads into two files. Technically, we want the rejected files (labeled as "other" files", as SortMeRNA is looking for rRNA (see User Manual)!

To run SortMeRNA, submit something like this:

#SBATCH --job-name=sortme_ssid_5
#SBATCH --output=sortme_ssid_5%j.log
#SBATCH --error=sortme_ssid_5%j.err
#SBATCH --mail-type=BEGIN,END,FAIL
#SBATCH --nodes=1
#SBATCH --ntasks=1
#SBATCH --mem-per-cpu=36gb
#SBATCH --cpus-per-task=6
#SBATCH --mail-user=[@ufl.edu]
#SBATCH --time=30-00:00:00
#SBATCH --account=juliemeyer
#SBATCH --qos=juliemeyer

export OMP_NUM_THREADS=6

# https://github.com/biocore/sortmerna/wiki/User-manual-v4.0#Example_multiple_databases_and_the_fastest_alignment_option
# for silva, LS=large subunit, SS=small subunit


module load sortmerna/4.3.6

sortmerna \
--workdir /blue/juliemeyer/r.howard/94_sortme/ \
--ref /blue/juliemeyer/share/SSID_RNAseq_rawreads/trimmed/SILVA_138.1_LSURef_NR99_tax_silva_trunc.fasta \
--ref /blue/juliemeyer/share/SSID_RNAseq_rawreads/trimmed/SILVA_138.1_SSURef_NR99_tax_silva_trunc.fasta \
--reads /blue/juliemeyer/share/SSID_RNAseq_rawreads/trimmed/TCP94_R1_trimmmed_Q20_1P.fastq.gz \
--reads /blue/juliemeyer/share/SSID_RNAseq_rawreads/trimmed/TCP94_R2_trimmmed_Q20_2P.fastq.gz \
--sam \
--fastx --out2 --other --paired_out \
--blast 1 \
--num_alignments 1 \
--threads 6 \
-v
rm -rf /blue/juliemeyer/r.howard/94_sortme/kvdb/
#and so on with your own samples...

Once you are completed running SortMeRNA on all of your samples, you will have a number of folders equal to the number of samples you've run through SortMeRNA. They will be named what you have named them in your code (for this example, my output folder was named 94_sortme). In this folder, you will see 3 folders. One folder will be labeled "out". These are where your output .fq.gz files will be. We want the ones that we specified as "other" in our code, as this is where our non rRNA reads will be.

Here's where it get's a little tricky. We were able to specify our output working directory, but not the output names of our desired .fq.gz files. We will have to go through and rename these files one by one in the CLI. To do that for this example:

cd /blue/juliemeyer/r.howard/sortmerna/94_sortme/out/
mv other_fwd.fq.gz TCP94_fwd.fq.gz
mv other_rev.fq.gz TCP94_rev.fq.gz

You need to repeat this process for every sample in order to rename these files or else they will all be labeled as "other" files. The next step is actually moving the files into another working directory, so that we can continue downstream analysis with just these files. To do that for all of these samples, run something like this code:

find /blue/juliemeyer/r.howard/sortmerna/ -type f -name "TCP*" -exec mv {} /blue/juliemeyer/r.howard/sortme_output/ \;

This command will move all files that start with "TCP" in the parent folder (sortmerna) to the destination folder (sortme_output).

I recommend running SortMeRNA 5-10 samples at a time, and these jobs can take around 2-3 days.