Example - Oshlack/Corset GitHub Wiki

Example pipeline

Here we provide a full example for going from RNA-Seq reads to gene-level differential expression results. I'm going to use the yeast example running Trinity from our paper because it's small enough to run quickly, but you can replace the file names with whatever data you have.

You can download the yeast data from here. You will need to download files SRR453566 to SRR453571. SRA provides tools to covert from their format to others here. Use:

fastq-dump --split-3 SRR453566.sra

on each file to get paired-end fastq files.

Prepare the reads

It's worth while cleaning the reads before running the de novo assembler. I like to use trimmomatic to trim off any ends which have low quality scored. It's fast and can handle paired-end reads well.

trimmomatic PE -phred33 SRR453566_1.fastq SRR453566_2.fastq SRR453566_P1.fastq SRR453566_U1.fastq SRR453566_P2.fastq SRR453566_U2.fastq LEADING:20 TRAILING:20 MINLEN:50

P here means paired and U means unpaired. We will ignore the unpaired reads for the remainder of this tutorial.

Perform the de novo assembly

In this example we will use Trinity to assemble the transcripts. You will have more power to detect transcripts if all the samples are pooled, therefore the first step is to concatenate the fastq files together:

cat *_P1.fastq | sed 's/ HWI/\/1 HWI/g' > all_1.fastq
cat *_P2.fastq | sed 's/ HWI/\/2 HWI/g' > all_2.fastq

The middle part of this code fixes the read IDs for Trinity by adding a \1 to the end of the first end's name and a \2 to the end of the 2nd ends name. This is a hack that I discovered is necessary for running Trinity on SRA data. Without fixing the read IDs, Trinity will loose the paired-end information. For Illumina Hi-Seq reads, I've found that Trinity works fine without doing this. You can check that the reads-ends are being interpreted correctly by looking at read IDs in the both.fa file the Trinity creates. If something has gone wrong you'll see an ID like ">SRR453566.19/H". Where the "H" should clearly be a "1" or "2".

now run Trinity:

<path to Trinity>/Trinity.pl --JM 10G --seqType fq --left all_1.fastq --right all_2.fastq

Note that this might take several hours to run. The result should be a fasta file of transcripts called Trinity.fasta

A couple of tricks in getting Trinity to work:

  • I have alway needed to set ulimit -s unlimited
  • I have often needed to increase the memory used by jellyfish. e.g. --JM 100G
  • I like to use the --full_cleanup option so I don't get hundres of big files left over.
  • Speed up the assembly by running in parallell. e.g. --CPU 16. However sometimes when you do this, you'll find that the job fails at the end because too many processes want memory. If this happens rerun Trinity with a smaller number of CPUs. It should restart where it stopped.

Map reads back to the transcriptome

Now you will need to align the reads back to the assembled transcriptome. You may use either salmon equivalence classes for this (fastest option) or any regular aligner which produces bam files such as bowtie or bowtie2. Note that a splice aware aligner is not necessary or advised.

Using salmon

Salmon is a fast aligner for quantifying transcript abundances. For the purposes of corset (from version 1.07 up), we can use one of salmon's outputs called equivalence classes to cluster transcripts and provide gene-level abundances. To run salmon for corset, start by indexing the assembled transcriptome:

salmon index --index Trinity --transcripts Trinity.fasta

Then quantify the reads (it's very important to specify the --dumpEq flag, so the equivalence classes are output):

FILES=`ls SRR*_P1.fastq | sed 's/_P1.fastq//g'`
for F in $FILES ; do
        R1=${F}_P1.fastq
        R2=${F}_P2.fastq
        salmon quant --index Trinity --libType A --dumpEq --hardFilter --skipQuant -1 $R1 -2 $R2 --output ${F}.out
done

The relevant files will be called eq_classes.txt.gz in the subdirectory aux_info. You will need to unzip the eq_classes.txt file if it is gzipped. You may want to adjust the salmon option for your own dataset, for example the library type or the number of threads. However, you should ensure that range factorization is switched off (e.g. with --hardFilter) or the eq_classes.txt files will not be compatible with corset.

Using bowtie

When using other aligners, you will need to multi-map the reads back to the transcriptome. In this example I will use bowtie. The bowtie option --all means that all alignments are reported. If you have a large dataset, you might like to limit the number of reported alignment to a large (but finite) number, as this will be faster and the resulting bam files will be smaller. In bowtie you could do this with -k 40 (40 reported alignments).

In the paper, and this example, we map the reads back as pair-end, meaning that if only one read end maps to a transcript and not both, neither is reported. There may be some disadvantages in this, but we have not experimented with it. Feel free to try different mapping options, such as single-end and let us know if it makes a difference!

Build the bowtie index:

bowtie-build Trinity.fasta Trinity

Map each file to the transcriptome. Here is a short bash script for doing that:

FILES=`ls SRR*_P1.fastq | sed 's/_P1.fastq//g'`
for F in $FILES ; do
        R1=${F}_P1.fastq
        R2=${F}_P2.fastq
        bowtie --all -S Trinity -1 $R1 -2 $R2 > ${F}.sam  
        samtools view -S -b ${F}.sam > ${F}.bam
done

Speed this up by running bowtie with multiple threads eg. using the flag -p 16

You might want to delete the .sam files at this point if they take up too much space

Run corset

The first three sample files (in numeric order) belong to one experimental group in our example while the final three belong to another. Batch versus chemostat growing conditions. We provide this information to Corset to improve the power it has when splitting differentially expressed paralogs (the -g option below gives these groupings). We also provide a name: A1..B3, for each of the size sample, which will appear in the header of the counts file. -n is an optional argument and if -g is left off, corset assumes that every file belong to a separate group. For bam files, run corset like so:

corset -g 1,1,1,2,2,2 -n A1,A2,A3,B1,B2,B3 *.bam

or, if we aligned using salmon:

corset -g 1,1,1,2,2,2 -n A1,A2,A3,B1,B2,B3 -i salmon_eq_classes SRR*/aux_info/eq_classes.txt

Note that salmon input files provide equivalence class information (which tells us how many fragments align to a set of transcripts). Corset can also output equivalence class information, given one or more bam file (from version 1.05 up), like so:

for FILE in `ls *.bam` ; do
   corset -r true-stop $FILE &
done
wait

This will start six concurrent corset jobs at once. Each job will process one bam file and output a file named <original bam name>.corset-reads. The jobs will stop before any clustering or counting is performed. The script waits for all processes to finish.

The corset files can been be processed for clustering and counting in a similar manner to the salmon files:

corset -g 1,1,1,2,2,2 -n A1,A2,A3,B1,B2,B3 -i corset *.corset-reads

Running corset on .bam then .corset-reads files like this (rather than in a single command) has a couple of advantages. Firstly, it makes it possible to parallelise the part of corset that reads in bam files because each can be processed concurrently (corset doesn't support multiple threads). Secondly, it makes re-running corset much faster if necessary, because equivalence class files are much faster for corset to read in than bam files.

As a side note, if you plan on assembling superTranscripts for each of your clusters so that you can look for differential transcript usage etc., you should run corset with the -D parameter set high. For example:

corset -D 99999999999 -g 1,1,1,2,2,2 -n A1,A2,A3,B1,B2,B3 -i corset *.corset-reads

This will ensure that transcripts within a gene with differential transcript usage are never split into different clusters. Then follow Step 4 onwards from the Lace instructions at https://github.com/Oshlack/Lace/wiki/Example%3A-Differential-Transcript-Usage-on-a-non-model-organism#step-4-create-your-supertranscripts-using-lace.

Run edgeR

egdeR is a bioconductor package in R which will be used to test the count data for significant differential expression. The snippet of code below shows how you can get a list of the top differentially expressed clusters.

library(edgeR)

count_data<-read.delim("counts.txt",row.names=1)
group<-factor(c(1,1,1,2,2,2))

dge<-DGEList(counts=count_data,group=group)

#Optional: Exclude rows with fewer than 1 read per million in at least 3 samples
#and re-populate dge with filtered records + update library sizes:
keep <- rowSums(cpm(dge)>1) >= 3
dge <- dge[keep,,keep.lib.sizes=FALSE]

dge<-calcNormFactors(dge)
dge<-estimateCommonDisp(dge)
dge<-estimateTagwiseDisp(dge)

results<-exactTest(dge)
topTags(results)

Annotation

Clusters of interest can then be examined further and annotated.

Adam Taranto has a written a nice python script to help with this ( fetchClusterSeqs.py). It will take the output from corset (clusters.txt), the assembly fasta file, and a file containing a list of clusters of interest, and generate a new fasta file containing only the contigs of interest (with their cluster ID appended to the contig ID).

So, for example, you could output all the differentially expressed clustered found using edgeR. In R:

allTags<-topTags(results,n=dim(results$table)[1])[1](/Oshlack/Corset/wiki/1)
deClusters=rownames(allTags)[allTags$FDR<0.05]
write.table(deClusters,"clusters_of_interest.txt",quote=F,row.names=F,col.names=F)

Extract the contig sequences (assuming the python script is in the same directory). On the command line:

./fetchClusterSeqs.py -i Trinity.fasta -t clusters_of_interest.txt -o contigs_of_interest.fasta -c clusters.txt

and then run blast2GO on the file contigs_of_interest.fasta to work out what the genes were.

Note: You may have greater success in annotating these transcripts if you first predict an optimal ORF (based on homology to related species or conserved prodein domains) using a tool such as Transdecoder.

Bpipe Pipeline

bpipe is a nice tool for managing pipelines and Simon Sadedin has proved an example of running the steps above through bpipe https://code.google.com/p/bpipe/wiki/RNASeqCorset