Quantify gene expression with RSEM - Persilian/WGCNA GitHub Wiki

Once you have RNAseq read files and checked their quality, it’s time to map the reads to a reference genome of your choice and quantify gene expression. There are several ways how to map reads and quantify gene expression, here we will use bowtie2 to map reads and RSEM to quantify gene expression.

Generate RSEM reference

Generate an RSEM reference before running RSEM. Use rsem-prepare-reference with the adequate options fitting your resources. If you have a genome fasta available, you can use a gff3 annotation file to directly generate the reference. If you have a fasta file containing only transcripts of your genome, you will have to generate a gene-to-transcript file first.

Generate a gene-to-transcript file that specifies which transcripts belong to which genes. The gene-to-transcript file has to be a two-column file separated by a tab of the form:

gene_id \t transcript_id

For example:

AT1G01010 \t AT1G01010.1

If you have produced a gff3 file from your genome assembly, you can directly provide it in rsem-calculate-expression, or you may use the script GFF2gene2trans.sh to generate a gene-to-transcript file. If you have produced a .gtf file from your genome assembly, you may use the script gtf2gene2trans.sh to generate this gene-to-transcript file.

Quantify gene expression

Using the reference index you have created, you can now quantify gene expression with your RNAseq read files for example like so:

Single-end reads

rsem-calculate-expression --bowtie2 --num-threads 8 --no-bam-output \
path/to/read_files \
path/to/reference-name \
path/to/sample-name

For example using an array job script on a cluster.

RSEM="rsem-calculate-expression --bowtie2 --num-threads 8 --no-bam-output \
${fastqs}/${sam[$idx]}.fastq.gz \
~/path/to/RSEM-reference \
${sam}"

Submit the array job script together with the .json file like so:

sbatch submit_array_job.sh submit_array_job.json

Paired-end reads

rsem-calculate-expression --bowtie2 --num-threads 8 --paired-end --strandedness reverse \
upstream_read_file(s) \
downstream_read_file(s) \
reference-name \
sample-name

For example using an array job script on a cluster. For Illumina paired-end reads mind the adequate option of "--strandedness reverse".

RSEM="rsem-calculate-expression --bowtie2 --num-threads 8 --paired-end --strandedness reverse \
${fastqs}/${sam[$idx]}.fq.1.gz \
${fastqs}/${sam[$idx]}.fq.2.gz \
~/path/to/RSEM-reference \
${sam}"

Submit the array job script together with the .json file like so:

sbatch submit_array_job.sh submit_array_job.json

Generate TPM and TMM-normalized gene expression matrices using the TRINITY framework.

For some reason specifying a "gene_trans_map" here will prevent outputting a TMM-normalized expression matrix. A TMM-normalized expression matrix is the input for the weighted gene co-expression network analysis pipeline. Execute the code below in the RSEM-output directory that contains the .genes.results files.
Note: You can create TPM and TMM-normalized expression matrices using the .isoforms.results files, if you want to investigate differentially expressed splice variants along with differentially expressed genes (DEGs). Be aware that your reference genome has to be of good quality to reliably model splicing variants. Moreover, there will usually be much more DE-isoforms than DEGs and if you want to reduce the complexity of your data, more is not better.

abundance_estimates_to_matrix.pl --est_method RSEM \
--gene_trans_map none \
--cross_sample_norm TMM \
--out_prefix name_your_output \
sample1.genes.results \
sample2.genes.results \
sample3.genes.results

Optionally you can do a differential gene expression analysis using the TRINITY framework.

Generate a "samples.txt" that states which samples (RNAseq libraries) belong to which treatment, of the form:

control \t control1
control \t control2
cold \t cold1
cold \t cold2

Generate a "treatment_comparison_file.txt" (contrasts), where you state the comparisons you would like to make for differential gene expression. The file is a simple tab-delimited file of the form:

cold \t control
heat \t control

If you don't provide a "contrasts-file", all possible comparisons will be made. Including cold-heat, which is often not of interest.
Note: that the direction of the contrasts matters. That is, cold-control will yield positive log-fold-changes (logFCs) for genes upregulated in cold, while negative logFCs will represent genes down-regulated in cold. If you reverse that order in the contrasts file, the sign of the logFCs will also be reversed!

Use these files to run differential gene expression analysis like so:

run_DE_analysis.pl \
--matrix ~/path/to/RSEM.genes.counts.matrix \
--method edgeR \
--samples_file ~/path/to/list/of/samples/samples.txt \
--output ~/path/to/DE_analysis_output \
--contrasts ~/path/to/treatment_comparison_file.txt

Extract differentially expressed genes

Do this in the directory where the results from the "run_DE_analysis.pl" are located. The files you need are labled with your output prefix and look something like this: WGCNA_course.genes.counts.matrix.cold_vs_control.DESeq2.DE_results.

Extract >2-fold (-C 1) differentially expressed genes at an adjusted p-value of <0.001 (-P 1e-3).

analyze_diff_expr.pl \
--matrix ~/path/to/RSEM.genes.TMM.EXPR.matrix \
--samples ~/path/to/list/of/samples/samples.txt \
-P 1e-3 -C 1 \
--output name_of_your_output