4A. Day 3. Genome and transcriptome assembly methods and strategies - bioinfokushwaha/Livestock_Genomics GitHub Wiki

The following steps are involved:

  • Data download
  • Quality assessment and control (optional: not need for exercise)
  • de novo RNASeq assembly and Evaluation
  • Genome assembly and Evaluation analysis

Long read Assembly Exercise

Long_read_assembly.pdf

Short read Assembly Exercise

Transcriptome Assembly

==============================================

Transcriptome_Assembly_Final.pdf

About the Tutorial

This tutorial is based on the Trinity toolkit script and programs along with other software and tools. For the exercise, a small part of the original dataset have been used. So, participants have the opportunity to complete an assembly without using whole dataset. But, already QC and de novo assembly have been performed for exercise to save time.

Availability of data

Pair end Illumina data for exercise can be download from here: Download (CB:rnaseq-tutorial)

1. RNAseq data download

Download the input data to your tutorial working directory by using following unix commands

$ mkdir Day5 && cd Day5 && mkdir Data && 
$  cp ~/Workshop/Day5/Data/*gz Data/

May be, it is better to change file name for convenience

mkdir Raw
cp Data/hcc1395_normal_rep1_r1.fastq.gz Raw/NR1_r1.fastq.gz
cp Data/hcc1395_normal_rep1_r2.fastq.gz Raw/NR1_r2.fastq.gz
cp Data/hcc1395_normal_rep2_r1.fastq.gz Raw/NR2_r1.fastq.gz
cp Data/hcc1395_normal_rep2_r2.fastq.gz Raw/NR2_r2.fastq.gz
cp Data/hcc1395_normal_rep3_r1.fastq.gz Raw/NR3_r1.fastq.gz
cp Data/hcc1395_normal_rep3_r2.fastq.gz Raw/NR3_r2.fastq.gz
cp Data/hcc1395_tumor_rep1_r1.fastq.gz Raw/TR1_r1.fastq.gz
cp Data/hcc1395_tumor_rep1_r2.fastq.gz Raw/TR1_r2.fastq.gz
cp Data/hcc1395_tumor_rep2_r1.fastq.gz Raw/TR2_r1.fastq.gz
cp Data/hcc1395_tumor_rep2_r2.fastq.gz Raw/TR2_r2.fastq.gz
cp Data/hcc1395_tumor_rep3_r1.fastq.gz Raw/TR3_r1.fastq.gz
cp Data/hcc1395_tumor_rep3_r2.fastq.gz Raw/TR3_r2.fastq.gz

It can be difficult when you have large number of files for processing. So it is better to make a script for reproducible work: Rename.sh

#!/bin/bash
mkdir Raw
for f in Data/*.fastq.gz \
do  file=`echo ${f}|sed 's/Data\///;s/hcc1395_//;s/normal_/N/;s/rep/R/;s/tumor_/T/'|tr "\n" " "`;\
cp ${f} Raw/${file} ;\
done;
## Deleting two replicate to reduce computational time. But complete results are already computed
rm Raw/NR3* Raw/NR2* Raw/TR3* Raw/TR2*;

2. Quality assessment and control (Optional exercise: not needed for exercise)

Run FastQC on your fastq files before and after trimming: Before choose ur nice number and stick with this number for whole workshop

mkdir -p FastQC/FastQC_Raw
nice -n 1 find Raw -name '*.gz' | xargs fastqc -o FastQC/FastQC_Raw/ -t 1
ls FastQC/FastQC_Raw

Trimmomatic will be used for read trimming and quality control: stick with your nice number

# Creating new folder for trimming files
mkdir Trim

# Runing trimming on reads
find Raw/ -name "*.gz"|grep "_r1"|sort|while read READ_FW; do FILEBASE=$(basename "${READ_FW%.gz}"); Sample=$(echo $FILEBASE|sed 's/_.*//' ); echo $READ_FW $FILEBASE $Sample; \
nice -n 1 java -jar /bioinfo/Trimmomatic-0.38/trimmomatic-0.38.jar PE -phred33 Raw/$Sample"_r1.fastq.gz" Raw/$Sample"_r2.fastq.gz" Trim/$Sample"_PE_R1.fq" Trim/$Sample"_SE_R1.fq" Trim/$Sample"_PE_R2.fq" Trim/$Sample"_SE_R2.fq"  \
ILLUMINACLIP:/bioinfo/Trimmomatic-0.38/adapters/TruSeq3-PE.fa:2:30:10 LEADING:3 TRAILING:3 SLIDINGWINDOW:4:15 MINLEN:36; done 

# Remove single end reads after trimming from Trim folder
rm Trim/*SE* 

# Runing fastqc on trim reads
mkdir -p FastQC/FastQC_Trim
nice -n 1 find Trim -name '*.fq' |xargs fastqc -o FastQC/FastQC_Trim/ -t 1
ls FastQC/FastQC_Trim

3. Compilation of QA and QC report through multiQC tool

mkdir -p MultiQC/MultiQC_Raw
mkdir -p MultiQC/MultiQC_Trim

nice -n 1 multiqc FastQC/FastQC_Raw/ -o MultiQC/MultiQC_Raw
nice -n 1 multiqc FastQC/FastQC_Trim/ -o MultiQC/MultiQC_Trim/

## Copy ur multiqc raw and trim files from remote server to ur desktop
1. Through scp command
2. Through filezilla 

Assignment: make a brief summary report for your performed QC analysis 
Note: it is recommend to make bash script for QA and QC for reproducible work: QC.sh

4. De novo assembly through Trinity (Optional exercise: Not needed for exercise)

Trinity is one of the RNASeq assembler; designed specifically for transcriptome assembly. Trinity combines three independent software modules: Inchworm, Chrysalis, and Butterfly,applied sequentially to process large volumes of RNAseq reads. Trinity assembles a transcriptome by first extending individual RNA-seq reads into longer contigs, building many de Bruijn graphs from these contigs, and then deriving all the splicing-isoform-representing paths in each graph. One of the good things about the Trinity toolkit is that it allows post assembly analysis i.e. assembly evaluation, quantification and differential expression analysis. Trinity latest version can be found here

Make a transcriptome assembly through extracted illumina reads with trinityrnaseq-Trinity-v2.6.5. Lower and higher version may be have slightly different processing options.


nice -n 1 $TRINITY_HOME/Trinity --seqType fq \
--left Trim/NR1_PE_R1.fq Trim/NR1_PE_R2.fq Trim/NR2_PE_R1.fq Trim/NR3_PE_R1.fq Trim/TR1_PE_R1.fq Trim/TR1_PE_R2.fq Trim/TR2_PE_R1.fq Trim/TR3_PE_R1.fq \
--right Trim/NR1_PE_R2.fq Trim/NR2_PE_R1.fq Trim/NR2_PE_R2.fq Trim/NR3_PE_R2.fq Trim/TR1_PE_R2.fq Trim/TR2_PE_R1.fq Trim/TR2_PE_R2.fq Trim/TR3_PE_R2.fq \
--CPU 1 --max_memory 2G --full_cleanup

OR
cat Trim/*PE_R1.fq >Left.fq
cat Trim/*PE_R2.fq >Right.fq
nice -n 1 $TRINITY_HOME/Trinity --seqType fq --CPU 1 --max_memory 2G --full_cleanup

Note: In this tutorial, we will use only one replicate of normal and tumour tissues (NR1 and TR1) with CPU=1 and --max_memory 2G.
Please remember your nice number and stick with this before submission of job

5. Transcript abundance estimation

For transcriptome abundance estimation, Trinity include alignment-based methods (reads aligned to the transcript assembly i.e. RSEM and eXpress) and alignment-free methods ( k-mer based abundances i.e.kallisto and salmon). Good thing is that Trinity comes with a script to facilitate running your choice tools to quantify transcript abundance.

If you don't get executable in path, export the path of bowtie2, RSEM, kallisto and Salmon

export PATH=$PATH:/bioinfo/bowtie2-2.3.4.3-linux-x86_64/ export PATH=$PATH:/bioinfo/RSEM-1.3.1/ export PATH=$PATH:/bioinfo/salmon-0.11.3-linux_x86_64/bin/ export PATH=$PATH:/bioinfo/kallisto_linux-v0.45.0/

5.1 Copy Trinity assembly and gene-transcript mapping file from Day5 folder

cp ~/Workshop/Day5/Assembly/Trinity.fasta* ./

Prepare a tab-delimited text file indicating biological replicate and file relationships i.e. samples.txt

Condition Replicate FR RR
Normal NR1 NR1_r1.fastq.gz NR1_r2.fastq.gz
Normal NR2 NR2_r1.fastq.gz NR2_r2.fastq.gz
Normal NR3 NR3_r1.fastq.gz NR3_r2.fastq.gz
Tumor TR1 TR1_r1.fastq.gz TR1_r2.fastq.gz
Tumor TR2 TR2_r1.fastq.gz TR2_r2.fastq.gz
Tumor TR3 TR3_r1.fastq.gz TR3_r2.fastq.gz

Note: In this tutorial, one replicate only.

5.2 Prepare reference for alignment and read count

nice -n 1 $TRINITY_HOME/util/align_and_estimate_abundance.pl --transcripts Trinity.fasta \
--est_method RSEM --aln_method bowtie2 --trinity_mode --prep_reference --thread_count 1

4.3 Run read mapping and abundance estimation

Essential Input
--transcripts: Genome/transcript fasta file
--seqType:fq|fa

If Paired-end:
--left: FR.fastq
--right: RR.fastq
OR
Single-end
--single: single.fastq

--samples_file:samples.txt

--est_method:
alignment_based:  RSEM|eXpress       
alignment_free: kallisto|salmon


you can explore other options of script through --help

$TRINITY_HOME/util/align_and_estimate_abundance.pl --help

Abundance estimation using RSEM: RSEM uses strategy of expectation maximization to assign reads according to maximum likelihood estimates. The script will use Bowtie aligner to align reads, and RSEM will estimate expression values for each transcripts individually for each replicate.

nice -n 1 $TRINITY_HOME/util/align_and_estimate_abundance.pl --transcripts Trinity.fasta --seqType fq \
--left Trim/NR1_PE_R1.fq --right Trim/NR1_PE_R2.fq --thread 1 --est_method RSEM --aln_method bowtie2 ±
--output_dir NR1

RSEM output: RSEM generates two output files containing the abundance estimation information andTop lines of files shown below in table.

    RSEM.genes.results : EM read counts on a per-Trinity-gene
    RSEM.isoforms.results : EM read counts per Trinity transcript 
transcript_id(s) gene_id length effective_length expected_count TPM FPKM
TRINITY_DN0_c0_g1_i1 TRINITY_DN0_c0_g1_i1 476.00 280.08 1.00 14.02 16.01
TRINITY_DN10000_c0_g1_i1 TRINITY_DN10000_c0_g1_i1 391.00 195.11 1.00 20.12 22.98
TRINITY_DN10001_c0_g1_i1 TRINITY_DN10001_c0_g1_i1 343.00 147.21 0.00 0.00 0.00
TRINITY_DN10003_c0_g1_i1 TRINITY_DN10003_c0_g1_i1 208.00 22.28 0.00 0.00 0.00
TRINITY_DN10004_c0_g1_i1 TRINITY_DN10004_c0_g1_i1 585.00 389.07 3.00 30.28 34.58
TRINITY_DN10005_c0_g1_i1 TRINITY_DN10005_c0_g1_i1 407.00 211.10 2.00 37.20 42.49
TRINITY_DN10006_c0_g1_i1 TRINITY_DN10006_c0_g1_i1 646.00 450.07 1.00 8.72 9.96
TRINITY_DN10007_c0_g1_i1 TRINITY_DN10007_c0_g1_i1 577.00 381.07 1.00 10.30 11.77
TRINITY_DN10008_c0_g1_i1 TRINITY_DN10008_c0_g1_i1 268.00 73.58 0.00 0.00 0.00
TRINITY_DN10009_c0_g1_i1 TRINITY_DN10009_c0_g1_i1 626.00 430.07 1.00 9.13 10.43

To facilitate running of multiple library, its good to make a script

#!/bin/bash
for f in Trim/*_PE_R1.fq; do file=${f%_*}; sample=`echo $file|sed 's/Trim\///;s/_PE//'`; tail1=_R1.fq; tail2=_R2.fq; nice -n 1 $TRINITY_HOME/util/align_and_estimate_abundance.pl --seqType fq --transcripts Trinity.fasta --est_method RSEM --aln_method bowtie2 --left $file$tail1 --right $file$tail2 --thread 1 --output_dir $sample; done; 


After running this script,you need to change common file name into sample name

for f in TR*/*.isoforms.results ; do  DIR=$(dirname $f); echo $DIR; tail=.isoforms.results; cp $f $DIR$tail; done;
for f in NR*/*.isoforms.results ; do  DIR=$(dirname $f); echo $DIR; tail=.isoforms.results; cp $f $DIR$tail;  done; 

Similarly, you can do it for gene count files

Mapping of all fastq files will take time. But, we can get the rest of the files from the pre-computed example data

cp ~/Workshop/Day5/Count/*.results ./

4.4 Gene and transcript expression matrices of samples

We will make combine gene and transcript count for samples and perform normalization through Trinity toolkit script abundance_estimates_to_matrix.pl. First we list genes/isofroms files through unix command

    ls |grep "genes"|tr "\n" " "

Trinity script support two type normalization i.e. TMM and UpperQuartile. Now we can execute script to generate raw count and normalized matrix bu using either of method.

Essential Parameters:

--est_method: RSEM|eXpress|kallisto|salmon (For exercise only RSEM and salmon)

nice -n 1 $TRINITY_HOME/util/abundance_estimates_to_matrix.pl --est_method RSEM NR1.isoforms.results NR2.isoforms.results NR3.isoforms.results TR1.isoforms.results TR2.isoforms.results TR3.isoforms.results --out_prefix Trans --gene_trans_map Trinity.fasta.gene_trans_map 

Output: Script generate two output files

Raw Count: RSEM.genes.counts.matrix/RSEM.isoforms.counts.matrix

Transcripts NR1 NR2 NR3 TR1 TR2 TR3
TRINITY_DN1959_c0_g3_i1 7.00 3.00 2.00 8.00 5.00 2.00
TRINITY_DN7093_c0_g1_i1 4.00 2.00 1.00 0.00 0.00 0.00
TRINITY_DN7456_c0_g1_i1 1.00 3.00 0.00 2.00 1.00 1.00
TRINITY_DN10426_c0_g1_i1 0.00 0.00 0.00 0.00 1.00 0.00
TRINITY_DN2030_c1_g1_i2 0.00 0.00 0.00 1.00 1.00 0.00
TRINITY_DN17_c0_g1_i1 1.00 1.00 0.00 0.00 0.00 0.00
TRINITY_DN5616_c0_g1_i1 1.00 3.00 3.00 0.00 0.00 0.00
TRINITY_DN6991_c0_g1_i1 0.00 0.00 0.00 1.00 2.00 0.00
TRINITY_DN2064_c5_g12_i1 0.00 1.00 0.00 3.00 2.00 1.00

Normalized Count Matrix: Genes.gene.TMM.EXPR.matrix/Isoforms.isoforms.TMM.EXPR.matrix

Transcripts NR1 NR2 NR3 TR1 TR2 TR3
TRINITY_DN1959_c0_g3_i1 56.394 24.351 17.910 62.554 37.675 16.092
TRINITY_DN7093_c0_g1_i1 93.475 46.972 25.859 0.000 0.000 0.000
TRINITY_DN7456_c0_g1_i1 31.210 93.974 0.000 58.319 28.035 29.835
TRINITY_DN10426_c0_g1_i1 0.000 0.000 0.000 0.000 80.908 0.000
TRINITY_DN2030_c1_g1_i2 0.000 0.000 0.000 31.244 30.023 0.000
TRINITY_DN17_c0_g1_i1 38.320 38.425 0.000 0.000 0.000 0.000
TRINITY_DN5616_c0_g1_i1 35.626 107.199 117.856 0.000 0.000 0.000
TRINITY_DN6991_c0_g1_i1 0.000 0.000 0.000 57.095 109.506 0.000
TRINITY_DN2064_c5_g12_i1 0.000 13.266 0.000 38.027 24.411 13.017

5. Evaluation of biological replicates of samples

RNA sequence generation is chain of steps. So, it is very important to examine any confounding issues caused in whole process. Conceptually, replicates should be more similar to each other than different replicate of other samples because we assuming that the signal in the data is more stronger than noise. Trinity facilitate evaluation of replicate through 'PtR' script which generate various charts and plots based on a matrix input file in pdf format.

5.1 Compare your replicates within sample

Run PtR script to compare your biological replicates of samples:

nice -n 1 $TRINITY_HOME/Analysis/DifferentialExpression/PtR -m Trans.gene.counts.matrix -s samples.txt  --log2 --compare_replicates 

Output: Script will generate two pdf report (Normal.rep_compare.pdf, Tumor.rep_compare.pdf )for each of sample. View each of these pdf-formatted reports

$ scp 
OR 
$ FileZilla

Result:1. Sum of fragments of replicates: A plot showing the sum of mapped fragments

Result:2. MA plots of replicates:Pairwise MA plots (x-axis: mean log(CPM), y-axis log(fold_change))

Result:3. Clustering of replicates:Clustering and heatmap by Pearson correlation score

5.2 Compare your replicates across all samples

PtR script generate a replicate correlation plot which contains all replicates across all samples:

nice -n 1 $TRINITY_HOME/Analysis/DifferentialExpression/PtR -m Trans.gene.counts.matrix  -s samples.txt --log2 --sample_cor_matrix

As we can see in the image, the replicates are more highly correlated within samples than between samples
(Generated pdf Trans.gene.counts.matrix.log2.sample_cor_matrix.pdf)

5.3 Comparing samples through PCA

nice -n 1 $TRINITY_HOME/Analysis/DifferentialExpression/PtR -m Trans.gene.counts.matrix -s samples.txt --log2 --CPM --prin_comp 3

In this PCA image of sampling, the replicates cluster tightly according to sample type, which is very reassuring. (Generated pdf: Trans.gene.counts.matrix.CPM.log2.prcomp.principal_components.pdf)

6. Differential Gene Expression Analysis

Differentially expressed transcripts or genes will be identified through running run_DE_analysis.pl script of Trinity pipeline. Script will perform pairwise comparisons among samples. We will use gene/transcripts.counts.matrix file to analyze transcripts.

Input: 
--method: DESeq2/edgeR
--matrix: Raw count matrix
--samples_file: Name of sample files

nice -n 1 $TRINITY_HOME/Analysis/DifferentialExpression/run_DE_analysis.pl --method DESeq2 --matrix Trans.gene.counts.matrix --samples_file samples.txt 

Output The output DE analysis can be found in the output directory you specified (Default directory name is used method name). File with extension .DE_results contains log fold change and statistical significance of Transcripts

cd DESeq2....
ls 
Trans.gene.counts.matrix.Normal_vs_Tumor.DESeq2.DE_results.MA_n_Volcano.pdf
Trans.gene.counts.matrix.Normal_vs_Tumor.DESeq2.DE_results    

Next step in analyzing differential expression is to extract most differentially expressed transcripts with significant FDR and fold-change parameters and cluster the transcripts according to their patterns of differential expression across the samples.

nice -n 1 $TRINITY_HOME/Analysis/DifferentialExpression/analyze_diff_expr.pl --matrix ../Trans.gene.TMM.EXPR.matrix --samples ../samples.txt -P 0.05 -C 2 

#  -P <float>             p-value cutoff for FDR  (default: 0.001)
#  -C <float>             min abs(log2(a/b)) fold change (default: 2  (meaning 2^(2) or 4-fold).

Output

  1. diffExpr.P0.05_C2.matrix
  2. diffExpr.P0.05_C2.matrix.log2.centered.genes_vs_samples_heatmap.pdf
  3. diffExpr.P0.05_C2.matrix.log2.centered.sample_cor_matrix.pdf

Assignment

Perform differential gene expression analysis at log2foldchange 2 and p-value cut-off 0.05 for below given combinations. Make a report of overlapping DEGs through venny and find the reason for differences.

  1. Abundance estimation by RSEM, cross_sample_norm by TMM and DESeq2
  2. Abundance estimation by RSEM, cross_sample_norm by UpperQuartile and DESeq2
  3. Abundance estimation by RSEM, cross_sample_norm by TMM and edgeR
  4. Abundance estimation by RSEM, cross_sample_norm by UpperQuartile and edgeR

Bibliography

Trinity: https://github.com/trinityrnaseq/trinityrnaseq/wiki
RSEM: https://github.com/deweylab/RSEM/blob/master/README.md
Salmon: http://salmon.readthedocs.io/en/latest/salmon.html
DESeq2: http://bioconductor.org/packages/release/bioc/manuals/DESeq2/man/DESeq2.pdf

============================================================

2. Genome assembly and Evaluation analysis

==========================================================

⚠️ **GitHub.com Fallback** ⚠️