The following steps are involved:

Data download
Quality assessment and control (optional: not need for exercise)
de novo RNASeq assembly and Evaluation
Genome assembly and Evaluation analysis

Long read Assembly Exercise

Long_read_assembly.pdf

Short read Assembly Exercise

ShortRead_Gnome_Assembly_Final.pdf

Transcriptome Assembly

==============================================

Transcriptome_Assembly_Final.pdf

About the Tutorial

This tutorial is based on the Trinity toolkit script and programs along with other software and tools. For the exercise, a small part of the original dataset have been used. So, participants have the opportunity to complete an assembly without using whole dataset. But, already QC and de novo assembly have been performed for exercise to save time.

Availability of data

Pair end Illumina data for exercise can be download from here: Download (CB:rnaseq-tutorial)

1. RNAseq data download

Download the input data to your tutorial working directory by using following unix commands

$ mkdir Day5 && cd Day5 && mkdir Data && 
$  cp ~/Workshop/Day5/Data/*gz Data/

May be, it is better to change file name for convenience

mkdir Raw
cp Data/hcc1395_normal_rep1_r1.fastq.gz Raw/NR1_r1.fastq.gz
cp Data/hcc1395_normal_rep1_r2.fastq.gz Raw/NR1_r2.fastq.gz
cp Data/hcc1395_normal_rep2_r1.fastq.gz Raw/NR2_r1.fastq.gz
cp Data/hcc1395_normal_rep2_r2.fastq.gz Raw/NR2_r2.fastq.gz
cp Data/hcc1395_normal_rep3_r1.fastq.gz Raw/NR3_r1.fastq.gz
cp Data/hcc1395_normal_rep3_r2.fastq.gz Raw/NR3_r2.fastq.gz
cp Data/hcc1395_tumor_rep1_r1.fastq.gz Raw/TR1_r1.fastq.gz
cp Data/hcc1395_tumor_rep1_r2.fastq.gz Raw/TR1_r2.fastq.gz
cp Data/hcc1395_tumor_rep2_r1.fastq.gz Raw/TR2_r1.fastq.gz
cp Data/hcc1395_tumor_rep2_r2.fastq.gz Raw/TR2_r2.fastq.gz
cp Data/hcc1395_tumor_rep3_r1.fastq.gz Raw/TR3_r1.fastq.gz
cp Data/hcc1395_tumor_rep3_r2.fastq.gz Raw/TR3_r2.fastq.gz

It can be difficult when you have large number of files for processing. So it is better to make a script for reproducible work: Rename.sh

#!/bin/bash
mkdir Raw
for f in Data/*.fastq.gz \
do  file=`echo ${f}|sed 's/Data\///;s/hcc1395_//;s/normal_/N/;s/rep/R/;s/tumor_/T/'|tr "\n" " "`;\
cp ${f} Raw/${file} ;\
done;
## Deleting two replicate to reduce computational time. But complete results are already computed
rm Raw/NR3* Raw/NR2* Raw/TR3* Raw/TR2*;

2. Quality assessment and control (Optional exercise: not needed for exercise)

Run FastQC on your fastq files before and after trimming: Before choose ur nice number and stick with this number for whole workshop

mkdir -p FastQC/FastQC_Raw
nice -n 1 find Raw -name '*.gz' | xargs fastqc -o FastQC/FastQC_Raw/ -t 1
ls FastQC/FastQC_Raw

Trimmomatic will be used for read trimming and quality control: stick with your nice number

# Creating new folder for trimming files
mkdir Trim

# Runing trimming on reads
find Raw/ -name "*.gz"|grep "_r1"|sort|while read READ_FW; do FILEBASE=$(basename "${READ_FW%.gz}"); Sample=$(echo $FILEBASE|sed 's/_.*//' ); echo $READ_FW $FILEBASE $Sample; \
nice -n 1 java -jar /bioinfo/Trimmomatic-0.38/trimmomatic-0.38.jar PE -phred33 Raw/$Sample"_r1.fastq.gz" Raw/$Sample"_r2.fastq.gz" Trim/$Sample"_PE_R1.fq" Trim/$Sample"_SE_R1.fq" Trim/$Sample"_PE_R2.fq" Trim/$Sample"_SE_R2.fq"  \
ILLUMINACLIP:/bioinfo/Trimmomatic-0.38/adapters/TruSeq3-PE.fa:2:30:10 LEADING:3 TRAILING:3 SLIDINGWINDOW:4:15 MINLEN:36; done 

# Remove single end reads after trimming from Trim folder
rm Trim/*SE* 

# Runing fastqc on trim reads
mkdir -p FastQC/FastQC_Trim
nice -n 1 find Trim -name '*.fq' |xargs fastqc -o FastQC/FastQC_Trim/ -t 1
ls FastQC/FastQC_Trim

3. Compilation of QA and QC report through multiQC tool

mkdir -p MultiQC/MultiQC_Raw
mkdir -p MultiQC/MultiQC_Trim

nice -n 1 multiqc FastQC/FastQC_Raw/ -o MultiQC/MultiQC_Raw
nice -n 1 multiqc FastQC/FastQC_Trim/ -o MultiQC/MultiQC_Trim/

## Copy ur multiqc raw and trim files from remote server to ur desktop
1. Through scp command
2. Through filezilla 

Assignment: make a brief summary report for your performed QC analysis 
Note: it is recommend to make bash script for QA and QC for reproducible work: QC.sh

4. De novo assembly through Trinity (Optional exercise: Not needed for exercise)

Trinity is one of the RNASeq assembler; designed specifically for transcriptome assembly. Trinity combines three independent software modules: Inchworm, Chrysalis, and Butterfly,applied sequentially to process large volumes of RNAseq reads. Trinity assembles a transcriptome by first extending individual RNA-seq reads into longer contigs, building many de Bruijn graphs from these contigs, and then deriving all the splicing-isoform-representing paths in each graph. One of the good things about the Trinity toolkit is that it allows post assembly analysis i.e. assembly evaluation, quantification and differential expression analysis. Trinity latest version can be found here

Make a transcriptome assembly through extracted illumina reads with trinityrnaseq-Trinity-v2.6.5. Lower and higher version may be have slightly different processing options.


nice -n 1 $TRINITY_HOME/Trinity --seqType fq \
--left Trim/NR1_PE_R1.fq Trim/NR1_PE_R2.fq Trim/NR2_PE_R1.fq Trim/NR3_PE_R1.fq Trim/TR1_PE_R1.fq Trim/TR1_PE_R2.fq Trim/TR2_PE_R1.fq Trim/TR3_PE_R1.fq \
--right Trim/NR1_PE_R2.fq Trim/NR2_PE_R1.fq Trim/NR2_PE_R2.fq Trim/NR3_PE_R2.fq Trim/TR1_PE_R2.fq Trim/TR2_PE_R1.fq Trim/TR2_PE_R2.fq Trim/TR3_PE_R2.fq \
--CPU 1 --max_memory 2G --full_cleanup

OR
cat Trim/*PE_R1.fq >Left.fq
cat Trim/*PE_R2.fq >Right.fq
nice -n 1 $TRINITY_HOME/Trinity --seqType fq --CPU 1 --max_memory 2G --full_cleanup

Note: In this tutorial, we will use only one replicate of normal and tumour tissues (NR1 and TR1) with CPU=1 and --max_memory 2G.
Please remember your nice number and stick with this before submission of job

5. Transcript abundance estimation

For transcriptome abundance estimation, Trinity include alignment-based methods (reads aligned to the transcript assembly i.e. RSEM and eXpress) and alignment-free methods ( k-mer based abundances i.e.kallisto and salmon). Good thing is that Trinity comes with a script to facilitate running your choice tools to quantify transcript abundance.

If you don't get executable in path, export the path of bowtie2, RSEM, kallisto and Salmon

export PATH=$PATH:/bioinfo/bowtie2-2.3.4.3-linux-x86_64/ export PATH=$PATH:/bioinfo/RSEM-1.3.1/ export PATH=$PATH:/bioinfo/salmon-0.11.3-linux_x86_64/bin/ export PATH=$PATH:/bioinfo/kallisto_linux-v0.45.0/

5.1 Copy Trinity assembly and gene-transcript mapping file from Day5 folder

cp ~/Workshop/Day5/Assembly/Trinity.fasta* ./

Prepare a tab-delimited text file indicating biological replicate and file relationships i.e. samples.txt

Condition	Replicate	FR	RR
Normal	NR1	NR1_r1.fastq.gz	NR1_r2.fastq.gz
Normal	NR2	NR2_r1.fastq.gz	NR2_r2.fastq.gz
Normal	NR3	NR3_r1.fastq.gz	NR3_r2.fastq.gz
Tumor	TR1	TR1_r1.fastq.gz	TR1_r2.fastq.gz
Tumor	TR2	TR2_r1.fastq.gz	TR2_r2.fastq.gz
Tumor	TR3	TR3_r1.fastq.gz	TR3_r2.fastq.gz

Note: In this tutorial, one replicate only.

5.2 Prepare reference for alignment and read count

nice -n 1 $TRINITY_HOME/util/align_and_estimate_abundance.pl --transcripts Trinity.fasta \
--est_method RSEM --aln_method bowtie2 --trinity_mode --prep_reference --thread_count 1

4.3 Run read mapping and abundance estimation

Essential Input
--transcripts: Genome/transcript fasta file
--seqType:fq|fa

If Paired-end:
--left: FR.fastq
--right: RR.fastq
OR
Single-end
--single: single.fastq

--samples_file:samples.txt

--est_method:
alignment_based:  RSEM|eXpress       
alignment_free: kallisto|salmon


you can explore other options of script through --help

$TRINITY_HOME/util/align_and_estimate_abundance.pl --help

Abundance estimation using RSEM: RSEM uses strategy of expectation maximization to assign reads according to maximum likelihood estimates. The script will use Bowtie aligner to align reads, and RSEM will estimate expression values for each transcripts individually for each replicate.

nice -n 1 $TRINITY_HOME/util/align_and_estimate_abundance.pl --transcripts Trinity.fasta --seqType fq \
--left Trim/NR1_PE_R1.fq --right Trim/NR1_PE_R2.fq --thread 1 --est_method RSEM --aln_method bowtie2 ±
--output_dir NR1

RSEM output: RSEM generates two output files containing the abundance estimation information andTop lines of files shown below in table.

    RSEM.genes.results : EM read counts on a per-Trinity-gene
    RSEM.isoforms.results : EM read counts per Trinity transcript

transcript_id(s)	gene_id	length	effective_length	expected_count	TPM	FPKM
TRINITY_DN0_c0_g1_i1	TRINITY_DN0_c0_g1_i1	476.00	280.08	1.00	14.02	16.01
TRINITY_DN10000_c0_g1_i1	TRINITY_DN10000_c0_g1_i1	391.00	195.11	1.00	20.12	22.98
TRINITY_DN10001_c0_g1_i1	TRINITY_DN10001_c0_g1_i1	343.00	147.21	0.00	0.00	0.00
TRINITY_DN10003_c0_g1_i1	TRINITY_DN10003_c0_g1_i1	208.00	22.28	0.00	0.00	0.00
TRINITY_DN10004_c0_g1_i1	TRINITY_DN10004_c0_g1_i1	585.00	389.07	3.00	30.28	34.58
TRINITY_DN10005_c0_g1_i1	TRINITY_DN10005_c0_g1_i1	407.00	211.10	2.00	37.20	42.49
TRINITY_DN10006_c0_g1_i1	TRINITY_DN10006_c0_g1_i1	646.00	450.07	1.00	8.72	9.96
TRINITY_DN10007_c0_g1_i1	TRINITY_DN10007_c0_g1_i1	577.00	381.07	1.00	10.30	11.77
TRINITY_DN10008_c0_g1_i1	TRINITY_DN10008_c0_g1_i1	268.00	73.58	0.00	0.00	0.00
TRINITY_DN10009_c0_g1_i1	TRINITY_DN10009_c0_g1_i1	626.00	430.07	1.00	9.13	10.43

To facilitate running of multiple library, its good to make a script

#!/bin/bash
for f in Trim/*_PE_R1.fq; do file=${f%_*}; sample=`echo $file|sed 's/Trim\///;s/_PE//'`; tail1=_R1.fq; tail2=_R2.fq; nice -n 1 $TRINITY_HOME/util/align_and_estimate_abundance.pl --seqType fq --transcripts Trinity.fasta --est_method RSEM --aln_method bowtie2 --left $file$tail1 --right $file$tail2 --thread 1 --output_dir $sample; done;

After running this script,you need to change common file name into sample name

for f in TR*/*.isoforms.results ; do  DIR=$(dirname $f); echo $DIR; tail=.isoforms.results; cp $f $DIR$tail; done;
for f in NR*/*.isoforms.results ; do  DIR=$(dirname $f); echo $DIR; tail=.isoforms.results; cp $f $DIR$tail;  done;

Similarly, you can do it for gene count files

Mapping of all fastq files will take time. But, we can get the rest of the files from the pre-computed example data

cp ~/Workshop/Day5/Count/*.results ./

4.4 Gene and transcript expression matrices of samples

We will make combine gene and transcript count for samples and perform normalization through Trinity toolkit script abundance_estimates_to_matrix.pl. First we list genes/isofroms files through unix command

    ls |grep "genes"|tr "\n" " "

Trinity script support two type normalization i.e. TMM and UpperQuartile. Now we can execute script to generate raw count and normalized matrix bu using either of method.

Essential Parameters:

--est_method: RSEM|eXpress|kallisto|salmon (For exercise only RSEM and salmon)

nice -n 1 $TRINITY_HOME/util/abundance_estimates_to_matrix.pl --est_method RSEM NR1.isoforms.results NR2.isoforms.results NR3.isoforms.results TR1.isoforms.results TR2.isoforms.results TR3.isoforms.results --out_prefix Trans --gene_trans_map Trinity.fasta.gene_trans_map

Output: Script generate two output files

Raw Count: RSEM.genes.counts.matrix/RSEM.isoforms.counts.matrix

Transcripts	NR1	NR2	NR3	TR1	TR2	TR3
TRINITY_DN1959_c0_g3_i1	7.00	3.00	2.00	8.00	5.00	2.00
TRINITY_DN7093_c0_g1_i1	4.00	2.00	1.00	0.00	0.00	0.00
TRINITY_DN7456_c0_g1_i1	1.00	3.00	0.00	2.00	1.00	1.00
TRINITY_DN10426_c0_g1_i1	0.00	0.00	0.00	0.00	1.00	0.00
TRINITY_DN2030_c1_g1_i2	0.00	0.00	0.00	1.00	1.00	0.00
TRINITY_DN17_c0_g1_i1	1.00	1.00	0.00	0.00	0.00	0.00
TRINITY_DN5616_c0_g1_i1	1.00	3.00	3.00	0.00	0.00	0.00
TRINITY_DN6991_c0_g1_i1	0.00	0.00	0.00	1.00	2.00	0.00
TRINITY_DN2064_c5_g12_i1	0.00	1.00	0.00	3.00	2.00	1.00

Normalized Count Matrix: Genes.gene.TMM.EXPR.matrix/Isoforms.isoforms.TMM.EXPR.matrix

Transcripts	NR1	NR2	NR3	TR1	TR2	TR3
TRINITY_DN1959_c0_g3_i1	56.394	24.351	17.910	62.554	37.675	16.092
TRINITY_DN7093_c0_g1_i1	93.475	46.972	25.859	0.000	0.000	0.000
TRINITY_DN7456_c0_g1_i1	31.210	93.974	0.000	58.319	28.035	29.835
TRINITY_DN10426_c0_g1_i1	0.000	0.000	0.000	0.000	80.908	0.000
TRINITY_DN2030_c1_g1_i2	0.000	0.000	0.000	31.244	30.023	0.000
TRINITY_DN17_c0_g1_i1	38.320	38.425	0.000	0.000	0.000	0.000
TRINITY_DN5616_c0_g1_i1	35.626	107.199	117.856	0.000	0.000	0.000
TRINITY_DN6991_c0_g1_i1	0.000	0.000	0.000	57.095	109.506	0.000
TRINITY_DN2064_c5_g12_i1	0.000	13.266	0.000	38.027	24.411	13.017

5. Evaluation of biological replicates of samples

RNA sequence generation is chain of steps. So, it is very important to examine any confounding issues caused in whole process. Conceptually, replicates should be more similar to each other than different replicate of other samples because we assuming that the signal in the data is more stronger than noise. Trinity facilitate evaluation of replicate through 'PtR' script which generate various charts and plots based on a matrix input file in pdf format.

5.1 Compare your replicates within sample

Run PtR script to compare your biological replicates of samples:

nice -n 1 $TRINITY_HOME/Analysis/DifferentialExpression/PtR -m Trans.gene.counts.matrix -s samples.txt  --log2 --compare_replicates

Output: Script will generate two pdf report (Normal.rep_compare.pdf, Tumor.rep_compare.pdf )for each of sample. View each of these pdf-formatted reports

$ scp 
OR 
$ FileZilla

Result:1. Sum of fragments of replicates: A plot showing the sum of mapped fragments

Result:2. MA plots of replicates:Pairwise MA plots (x-axis: mean log(CPM), y-axis log(fold_change))

Result:3. Clustering of replicates:Clustering and heatmap by Pearson correlation score

5.2 Compare your replicates across all samples

PtR script generate a replicate correlation plot which contains all replicates across all samples:

nice -n 1 $TRINITY_HOME/Analysis/DifferentialExpression/PtR -m Trans.gene.counts.matrix  -s samples.txt --log2 --sample_cor_matrix

As we can see in the image, the replicates are more highly correlated within samples than between samples
(Generated pdf Trans.gene.counts.matrix.log2.sample_cor_matrix.pdf)

5.3 Comparing samples through PCA

nice -n 1 $TRINITY_HOME/Analysis/DifferentialExpression/PtR -m Trans.gene.counts.matrix -s samples.txt --log2 --CPM --prin_comp 3

In this PCA image of sampling, the replicates cluster tightly according to sample type, which is very reassuring. (Generated pdf: Trans.gene.counts.matrix.CPM.log2.prcomp.principal_components.pdf)

6. Differential Gene Expression Analysis

Differentially expressed transcripts or genes will be identified through running run_DE_analysis.pl script of Trinity pipeline. Script will perform pairwise comparisons among samples. We will use gene/transcripts.counts.matrix file to analyze transcripts.

Input: 
--method: DESeq2/edgeR
--matrix: Raw count matrix
--samples_file: Name of sample files

nice -n 1 $TRINITY_HOME/Analysis/DifferentialExpression/run_DE_analysis.pl --method DESeq2 --matrix Trans.gene.counts.matrix --samples_file samples.txt

Output The output DE analysis can be found in the output directory you specified (Default directory name is used method name). File with extension .DE_results contains log fold change and statistical significance of Transcripts

cd DESeq2....
ls 
Trans.gene.counts.matrix.Normal_vs_Tumor.DESeq2.DE_results.MA_n_Volcano.pdf
Trans.gene.counts.matrix.Normal_vs_Tumor.DESeq2.DE_results

Next step in analyzing differential expression is to extract most differentially expressed transcripts with significant FDR and fold-change parameters and cluster the transcripts according to their patterns of differential expression across the samples.

nice -n 1 $TRINITY_HOME/Analysis/DifferentialExpression/analyze_diff_expr.pl --matrix ../Trans.gene.TMM.EXPR.matrix --samples ../samples.txt -P 0.05 -C 2 

#  -P <float>             p-value cutoff for FDR  (default: 0.001)
#  -C <float>             min abs(log2(a/b)) fold change (default: 2  (meaning 2^(2) or 4-fold).

Output

diffExpr.P0.05_C2.matrix
diffExpr.P0.05_C2.matrix.log2.centered.genes_vs_samples_heatmap.pdf
diffExpr.P0.05_C2.matrix.log2.centered.sample_cor_matrix.pdf

Assignment

Perform differential gene expression analysis at log2foldchange 2 and p-value cut-off 0.05 for below given combinations. Make a report of overlapping DEGs through venny and find the reason for differences.

Abundance estimation by RSEM, cross_sample_norm by TMM and DESeq2
Abundance estimation by RSEM, cross_sample_norm by UpperQuartile and DESeq2
Abundance estimation by RSEM, cross_sample_norm by TMM and edgeR
Abundance estimation by RSEM, cross_sample_norm by UpperQuartile and edgeR

Bibliography

Trinity: https://github.com/trinityrnaseq/trinityrnaseq/wiki
RSEM: https://github.com/deweylab/RSEM/blob/master/README.md
Salmon: http://salmon.readthedocs.io/en/latest/salmon.html
DESeq2: http://bioconductor.org/packages/release/bioc/manuals/DESeq2/man/DESeq2.pdf

============================================================

2. Genome assembly and Evaluation analysis

==========================================================

4A. Day 3. Genome and transcriptome assembly methods and strategies - bioinfokushwaha/Livestock_Genomics GitHub Wiki

The following steps are involved:

Long read Assembly Exercise

Short read Assembly Exercise

ShortRead_Gnome_Assembly_Final.pdf

Transcriptome Assembly

About the Tutorial

Availability of data

1. RNAseq data download

2. Quality assessment and control (Optional exercise: not needed for exercise)

3. Compilation of QA and QC report through multiQC tool

4. De novo assembly through Trinity (Optional exercise: Not needed for exercise)

5. Transcript abundance estimation

5.1 Copy Trinity assembly and gene-transcript mapping file from Day5 folder

5.2 Prepare reference for alignment and read count

4.3 Run read mapping and abundance estimation

4.4 Gene and transcript expression matrices of samples

5. Evaluation of biological replicates of samples

5.1 Compare your replicates within sample

Result:1. Sum of fragments of replicates: A plot showing the sum of mapped fragments

Result:2. MA plots of replicates:Pairwise MA plots (x-axis: mean log(CPM), y-axis log(fold_change))

Result:3. Clustering of replicates:Clustering and heatmap by Pearson correlation score

5.2 Compare your replicates across all samples

5.3 Comparing samples through PCA

6. Differential Gene Expression Analysis

2. Genome assembly and Evaluation analysis

Use the following credentials to access the internet and NIAB server

⚠️ GitHub.com Fallback ⚠️

4A. Day 3. Genome and transcriptome assembly methods and strategies - bioinfokushwaha/Livestock_Genomics GitHub Wiki

The following steps are involved:

Long read Assembly Exercise

Short read Assembly Exercise

ShortRead_Gnome_Assembly_Final.pdf

Transcriptome Assembly

About the Tutorial

Availability of data

1. RNAseq data download

2. Quality assessment and control (Optional exercise: not needed for exercise)

3. Compilation of QA and QC report through multiQC tool

4. De novo assembly through Trinity (Optional exercise: Not needed for exercise)

5. Transcript abundance estimation

5.1 Copy Trinity assembly and gene-transcript mapping file from Day5 folder

5.2 Prepare reference for alignment and read count

4.3 Run read mapping and abundance estimation

4.4 Gene and transcript expression matrices of samples

5. Evaluation of biological replicates of samples

5.1 Compare your replicates within sample

Result:1. Sum of fragments of replicates: A plot showing the sum of mapped fragments

Result:2. MA plots of replicates:Pairwise MA plots (x-axis: mean log(CPM), y-axis log(fold_change))

Result:3. Clustering of replicates:Clustering and heatmap by Pearson correlation score

5.2 Compare your replicates across all samples

5.3 Comparing samples through PCA

6. Differential Gene Expression Analysis

2. Genome assembly and Evaluation analysis

Use the following credentials to access the internet and NIAB server

⚠️ **GitHub.com Fallback** ⚠️

⚠️ GitHub.com Fallback ⚠️