Differential Expression - Bioinformatics-Institute/transcriptomics_WBC GitHub Wiki

RNA-seq Flowchart - Module 4

3-ii. Differential Expression

Use Cuffmerge and Cuffdiff to compare the tumor and normal conditions. Refer to the Cufflinks manual for a more detailed explanation:

Cuffmerge basic usage:

cuffmerge [options]* <assembly_GTF_list.txt>
  • "assembly_GTF_list.txt" is a text file "manifest" with a list (one per line) of GTF files that you would like to merge together into a single GTF file.

Extra options specified below:

  • '-p 8' tells cuffmerge to use eight CPUs
  • '-o' tells cuffmerge to write output to a particular directory
  • '-g' tells cuffmerge where to find reference gene annotations. It will use these annotations to gracefully merge novel isoforms (for de novo runs) and known isoforms and maximize overall assembly quality.
  • '-s' tells cuffmerge where to find the reference genome files

Merge all 6 cufflinks results so that they will have the same set of transcripts for comparison purposes

cd $RNAWORKING
mkdir cuffmerge

ls cufflinks/*_R*/transcripts.gtf > cuffmerge/assembly_GTF_list.txt

cuffmerge -p 2 -o cuffmerge -g annotation/genes_chr22_ERCC92.gtf -s fasta/chr22_ERCC92.fa cuffmerge/assembly_GTF_list.txt

Cuffdiff basic usage:

cuffdiff [options] <transcripts.gtf> <sample1_hits.sam> <sample2_hits.sam> [... sampleN_hits.sam]
  • Supply replicate SAMs as comma separated lists for each condition:
  • Example: sample1_rep1.sam,sample1_rep2.sam,...sample1_repM.sam
  • '-p 8' tells cuffdiff to use eight CPUs
  • '-L' tells cuffdiff the labels to use for samples

Create necessary directories:

cd $RNAWORKING
mkdir -p cuffquant

## Generate the cuffquant binary format files for cuffdiff
for sample in HBR UHR
do
    for rep in 1 2 3
    do

    cuffquant -p 2 --library-type fr-firststrand --frag-len-mean 262 --frag-len-std-dev 80 --no-update-check -o cuffquant/${sample}_R${rep} cuffmerge/merged.gtf tophat/${sample}_R${rep}/accepted_hits.bam

    done
done

Perform UHR vs. HBR comparison, using all replicates, for known (reference only mode) transcripts:

cuffdiff -p 2 -L UHR,HBR -o cuffdiff --library-type fr-firststrand --frag-len-mean 262 --frag-len-std-dev 80 --no-update-check cuffmerge/merged.gtf cuffquant/UHR_R1/abundances.cxb,cuffquant/UHR_R2/abundances.cxb,cuffquant/UHR_R3/abundances.cxb cuffquant/HBR_R1/abundances.cxb,cuffquant/HBR_R1/abundances.cxb,cuffquant/HBR_R3/abundances.cxb

edgeR Analysis

In this tutorial you will:

  • Make use of the raw counts you generate above using htseq-count
  • edgeR is a bioconductor package designed specifically for differential expression of count-based RNA-seq data
  • This is an alternative to using cufflinks/cuffmerge/cuffdiff to find differentially expressed genes

First, create a mapping file to go from ENSG IDs (which htseq-count output) to Symbols:

mkdir edgeR
perl -ne 'if ($_=~/gene_id\s\"(ENSG\S+)\"\;\sgene_name\s\"(\S+)\"\;/){print "$1\t$2\n";} elsif ($_=~/gene_id\s\"(ERCC\S+)\"/){print "$1\t$1\n";}' annotation/genes_chr22_ERCC92.gtf | sort | uniq > edgeR/ENSG_ID2Name.txt

Then, launch Rstudio:

rstudio $RNASCRIPTS/edgeR.R

OPTIONAL

What does the raw output from Cuffdiff look like?

cd $RNAWORKING
ls -l cuffdiff
head cuffdiff/isoform_exp.diff
grep -P "gene_id|OK" cuffdiff/isoform_exp.diff | cut -f 2-6,8-10,12 | sort -k 9,9 | less -S

Press 'q' to exit the 'less' display

How many genes are there on this chromosome?

grep -v gene_id cuffdiff/gene_exp.diff | wc -l

How many were detected above 0 in UHR or HBR (take the sum of expression values for both and check for greater than 0)?

grep -v gene_id cuffdiff/gene_exp.diff | perl -ne '@line=split("\t", $_); $sum=$line[7]+$line[8]; if ($sum > 0){print "$sum\n";}' | wc -l

How many differentially expressed genes were found on this chromosome (p-value < 0.05)?

grep -v gene_id cuffdiff/gene_exp.diff | cut -f 12 | perl -ne 'if ($_ < 0.05){print "$_"}' | wc -l

Display the top 20 DE genes. Look at some of those genes in IGV - do they make sense?

grep -P "OK|gene_id" cuffdiff/gene_exp.diff | sort -k 12n,12n | head -n 20 | cut -f 3,5,6,8,9,10,12,13,14

Save all genes with P<0.05 to a new file.

grep -P "OK|gene_id" cuffdiff/gene_exp.diff | sort -k 12n,12n | cut -f 3,5,6,8,9,10,12,13,14 | perl -ne '@data=split("\t", $_); if ($data[6]<=0.05){print;}' > cuffdiff/DE_genes.txt
Previous Section This Section Next Section
Expression Differential Expression DE Visualization