Example: Differential Transcript Usage on a non model organism - Oshlack/Lace GitHub Wiki

Introduction

On this wiki page, the aim is to go through how a differential transcript usage analysis of a non-model organism might look like from start to finish.

If you don't want to go into the gruesome details, here is the short summary of what to do.

Quick Overview:

  1. DeNovo assemble raw reads into contigs (e.g with Trinity)
  2. Cluster contigs (e.g. with Corset or RapClust)
  3. Construct superTranscripts (e.g with Lace)
  4. Map reads to superTranscriptome (e.g. with STAR)
  5. Count reads per exon/block feature (e.g. with featureCounts)
  6. Run DTU analysis on counts.txt (e.g. DexSeq2 or diffSplice in limma packages)

Step 1: Get the data

There are not many public datasets with non-model organisms with alternative transcript usage, so for this example we will take RNA seq of drosophila melanogaster (which has known alternative splicing leading to sex determination). But we will treat it as if the genome is unknown and present the pipeline as if it were a non-model organism. The data and information on the study can be found here. But for this tutorial i simply took 3 random treated and 3 random untreated samples.

Step 2: DeNovo Assemble contigs using Trinity (or some other assembly program)

#Group all samples into one (this allows for a better de novo assembly)
cat *.fastq | sed 's/ HWI/\/1 HWI/g' > all.fastq

#Trinitise
Trinity --max_memory 20G --CPU 16  -seqType fq --single all_1.fastq  --full_cleanup
mv trinity_out_dir.Trinity.fasta Trinity.fasta
rm all.fastq #Remove fastq file 

Step 3: Clustering with Corset (or something else)

Contigs need to be clustered into gene groups before Lace can be used to assemble them into a superTranscript. Corset, Grouper, CD-HIT or Trinity's own clustering could be used for this. Here we describe using corset which was created by the same authors as Lace and it what we used for all results in our paper. To run corset you first need to align the reads to the de novo assembled transcriptome. We recommend doing this with either salmon (fast) or bowtie (accurate):

Running corset with salmon alignment

#build the index
salmon index --index Trinity --transcripts Trinity.fasta  
for FILE in `ls SRR*.fastq` ; do
    salmon quant --index Trinity --libType U --dumpEq -r $FILE --output ${FILE}.out
done

corset -D 9999999 -g 1,1,1,2,2,2 -n A1,A2,A3,B1,B2,B3 -i salmon_eq_classes SRR*out/aux_info/eq_classes.txt

Running corset with bowtie alignment

bowtie-build Trinity.fasta Trinity

for FILE in `ls SRR*.fastq` ; do
    bowtie --all -S Trinity $FILE > ${FILE}.sam  
    samtools view -S -b ${FILE}.sam > ${FILE}.bam
done
corset -D 9999999 -g 1,1,1,2,2,2 -n A1,A2,A3,B1,B2,B3 *.bam

It is important to specify the --dumpEq flag if running salmon, or the --all flag for bowtie. The -D flag is also important for corset if you are performing differential transcript usage analysis, but should be left off if you are interested in superTranscripts for other reasons.

Step 4: Create your superTranscripts using Lace

Here we now have DeNovo assembled contigs, Trinity.fasta from our raw reads as well as a clusters.txt file with information about how the contigs cluster together. These are our inputs to Lace.

python Lace.py Trinity.fasta clusters.txt -t --cores 16 -o FlyLace/

Caveat: As is mentioned in the usage documentation, when dealing with DeNovo assembled transcriptome sometimes Corset clusters too large clusters or trinity constructs monster contigs with several 100 kbps of sequences. This can cause problems for BLAT and lace unless your server has enormous ammounts of RAM. So if you run it to such troubles think about removing the large contig or the cluster with many contigs, since they are more than likely flooded with a deluge of junk.

Step 5: Map reads to superTranscriptome using STAR

Now we need to map reads to the superTranscriptome, again for the examples sake we just do a 1-pass STAR mapping.

#First Index
starindex="STAR_ST_Index"
mkdir $starindex

STAR --runMode genomeGenerate --runThreadN 16 --genomeDir $starindex --genomeFastaFiles FlyLace/SuperDuper.fasta --sjdbGTFfile FlyLace/SuperDuper.gff --sjdbGTFtagExonParentTranscript gene_id


#Now map each file with STAR
outdir="Map_ST"
mkdir $outdir
ls *.fastq | rev | cut -c 7- | rev | uniq > myfiles #Just get prefix to
for i in `cat myfiles`; do
    echo "Performing alignment on file $i"

    STAR --runMode alignReads \
    --twopassMode Basic \
    --runThreadN 20 \
    --outSAMtype BAM SortedByCoordinate \
    --genomeDir $starindex \
    --readFilesIn "$i".fastq \
    --outFileNamePrefix $outdir/"$i" ;
done

Step 6: Construct dynamic blocks annotation [Optional]

As an alternate annotation strategy one can create dynamic blocks which makes a block based on the split reads aligned to the superTranscriptome using a python script packaged with Lace, this script works with Sj.out.tab default output of STAR.

cd Map_ST
cat *SJ.out.tab > SJ.out.tab
python Mobius.py SJ.out.tab ../FlyLace/SuperDuper.fasta

Step 7: Sort and Index bam files [Optional]

You will notice that in the STAR output i explicitly tell STAR to output the bam files as unsorted. This makes the STAR alignment faster as well as the counting of reads per feature (exon/gene) in the next step. However we need to sort and index if we want to view the bams and annotation in IGV.

Step 8: Count reads per feature

So for a differential exon usage analysis we need to count how many reads per block, so we provide our SuperDuper.gff annotation file and specify to count by exon (-f) and handle double counting and reads which overlap multiple blocks (-B -O --fraction):

mkdir Counts
featureCounts -T 20 -p -B -O --fraction -a Spliced.gtf  -f -o Counts/counts.txt *.out.bam

Step 9: Run your DTU pipeline

Now you can run you standard DTU pipeline, i recommend DexSeq or DiffSplice in BioConductor on the counts.txt file. (There is a DexSeq script to do a standard analysis which is supplementary to the superTranscript paper). But here i will use diff Splice which from my experience is much faster but also more conservative, so perhaps explore both. There is alot more filtering/normalisation one can do with this kind of analysis, but as baseline here is my R script to run a rudimentary diff splice analysis.

voom_diff.R:

library(edgeR)
#Read in data
counts <- read.table("counts.txt",header=TRUE,sep="\t")

#Define groups
treatment= c('1','1','1','2','2','2')
sample = c('1','2','3','4','5','6')

#Make DGElist and normalise
dx <- DGEList(counts[,c(7:12)])
dx<-calcNormFactors(dx,group=treatment)

#Make exon id
eid = paste0(counts$Chr,":",counts$Start)

#Define design matrix
design <- model.matrix(~treatment)

#Voom transform
vx <- voom(dx,design)
#Fit with limma
fx <- lmFit(vx,design)

ex <- diffSplice(fx,geneid=counts$Chr,exonid=eid)
res<-topSplice(ex,number="20")
write.table(res[,c("GeneID","P.Value","FDR")],"VoomSplice.txt",col.names = FALSE,row.names=FALSE,quote=FALSE,sep="\t")

#Plot Splice [OPTIONAL]
plotSplice(ex) #Plots the top gene

The above is just one of the top 5 clusters which we observe significant DTU, the red points are the exons which have significant DTU. If one is considering an organism which is similar to another model organism species one could then use BLAT to align the cluster sequence to the model organism genome in order to ascertain which part of the genome or which genes the cluster may be similar too.