Step 2: Filtering - srkoppolu/SK_RNA-Seq GitHub Wiki

The second step in the pipeline is the removal of ribosomal (rRNA) sequences and filtering of sequences based on base quality.

Removal of Ribosomal Sequences:

SortMeRNA is a local sequence alignment tool for filtering, mapping and clustering. It is widely used for sensitive analysis of NGS reads.

The main application of SortMeRNA is to filter rRNA from metratranscriptomic data. It takes a fasta/fastq file of sequence reads and one/multiple rRNA database file(s) as input, and separates aligned and rejected reads into two files as specified by the user. Additionally, clustering and taxonomy assigning applications are also available through QIIME.

SortMeRNA works with Illumina, Ion Torrent and PacBio data, and can produce SAM and BLAST-like alignments. For further information, please click here.

To install SortMeRNA in bash:

# get binary distro (substitute the release version as desired)
wget https://github.com/biocore/sortmerna/releases/download/v3.0.3/sortmerna-3.0.3-linux.sh

# view the installer usage
chmod +x ./sortmerna-3.0.3-linux.sh
./sortmerna-3.0.3-linux.sh --help

# set PATH
export PATH=$HOME/sortmerna/bin:$PATH

# test the installation
sortmerna --version

# view help
sortmerna -h

To run SortMeRNA in bash:

# Incase of multiple fastq files for the paired end reads, unzip and merge the files into a single fq file for each individual sample.
gunzip *.gz
merge-paired-reads.sh forward-reads.fq reverse-reads.fq merged-reads.fq

# Run sortmerna for each individual sample
sortmerna --ref $SORTMERNA_DB --reads sample_1_merged.fq --paired_in -a 16 --log --fastx --aligned sample_1_rRNA --other sample_1_sortmerna 

# Unmerge the paired reads
unmerge-paired-reads.sh sample_1_sortmerna.fq sample_1_sortmerna_1.fq sample_1_sortmerna_2.fq

# Look at the log file 
more sortmerna/sample_1_rRNA.log

The scripts for merge-paired-reads.sh and unmerge-paired-reads.sh can be found in the sortmerna repository in GitHub.

To run sortmeRNA for multiple samples, please edit and run the sortmeRNAfilter.sh script.


Base Filtering:

The next step after the removal of rRNA sequences is the quality trimming and adapter clipping.

Trim Galore:

Trim Galore is a wrapper around Cutadapt and FastQC to consistently apply adapter and quality trimming to fastq files, with extra functionality for RRBS data. For proper functioning, it is important to ensure that these two softwares are available and that the trim_galore script is copied to a location available on PATH.

For example:

# Check that cutadapt is installed
cutadapt --version  

# Check that FastQC is installed
fastqc -v  

# Install Trim Galore
curl -fsSL https://github.com/FelixKrueger/TrimGalore/archive/0.6.0.tar.gz -o trim_galore.tar.gz
tar xvzf trim_galore.tar.gz  

# Run Trim Galore
~/TrimGalore-0.6.0/trim_galore

For further instructions on Trim Galore, please refer the User Guide.

Trimmomatic:

Another method to do quality trimming and adapter clipping for Illumina sequencing data is to use trimmomatic.

You often don't need leading and traling clipping. Also in general keepBothReads can be useful when working with paired end data, you will keep even redunfant information but this likely makes your pipelines more manageable. Note the additional :2 in front of keepBothReads this is the minimum adapter length in palindrome mode, you can even set this to 1. (Default is a very conservative 8)

Run trimmomatic on each sample individually (use in a bash script for efficiency):

# Run Trimmomatic
trimmomatic PE -threads 8 -phred64 sortmerna/sample_1_sortmerna_1.fq sortmerna/sample_1_sortmerna_2.fq sample_1_sortmerna_trimmomatic_1.fq sample_1_sortmerna_unpaired_1.fq sample_1_sortmerna_trimmomatic_2.fq sample_1_sortmerna_unpaired_2.fq ILLUMINACLIP:/usr/share/Trimmomatic-0.33/adapters/TruSeq3-PE.fa:3:30:10 SLIDINGWINDOW:5:20 MINLEN:50

To run trimmomatic for multiple samples, please edit and run the trimmomaticfilter.sh script.