5. Differential Expression analyses - Sara-SL/GenomeAnalysis GitHub Wiki

Method

Since there was a problem with running the Maker pipeline we got the instructions to start working on the DEseq analysis with the original scaffold (sel4_NW_015503979.fna) and the original annotation (GCF_001595765.1_Mnat.v1_genomic.gff). To do the DEseq I first needed to run Htseq. For this I used the accepted_hits.bam file (tophat output file) that I got before running Trinity and the original annotation. The Tophat output file was thereby based on the original scaffold and all reads from all samples. I run Htseq using a batch script with the following command

htseq-count -f bam -r pos -i ID -s yes --samout=htseq_out /home/sarasl/git/GenomeAnalysis/3_rna_assembly/tophat_out/accepted_hits.bam /home/sarasl/git/GenomeAnalysis/Data/GCF_001595765.1_Mnat.v1_genomic.gff

With this command I got an error saying "Sequence of paired-end alignments expected, but got single-end alignment." so I separated the paired end and single end files using these commands

samtools view -bf 1 /home/sarasl/git/GenomeAnalysis/3_rna_assembly/tophat_out/accepted_hits.bam > accepted_hits.paired-end.bam

samtools view -bF 1 /home/sarasl/git/GenomeAnalysis/3_rna_assembly/tophat_out/accepted_hits.bam > accepted_hits.single-end.bam

I looked at the output and saw that the paired end by far were the majority, so I just ignored the single end reads and run Htseq again with the accepted_hits.paired-end.bam. From the slurm file I could see how many no-feature counts I had but not the feature count. Therefore I created a python script counting the sum of feature counts. Since that generated a decent number that was good enough for the annotation I had, I moved on to the next step.

Next step was to run Htseq for each replicate separately since that would improve the statistical robustness when using them as replicates in DEseq2. I created a batch script containing all 18 tophat runs needed for this. The output was then used in a new batch script running Htseq for each replicate. I only used the paired end reads when running Htseq for each replicate.

Last but not least I created an R script, reading the feature count files from Htseq(slurm files) and running DEseq2. I used this webpage as guidance to generate the R script. Since Htseq only outputs the gene ID I also used the original annotation to extract the gene name for each gene ID. In this way I found which genes that were differently expressed in forelimb vs hindlimb.

Result

From DEseq2 analysis I got 22 genes that had significant high differential expression between forelimb and hindlimb (p-value cutoff 0,05). The genes can be seen in the table 1 below. In this table one can see that 9 genes was down-regulated in the forelimb compared to hindlimb while 13 where up-regulated. With the same p-value cutoff as in the paper, 0,01, I only found 12 genes.

Table 1

table

In the figures below each dot represent one replicate. In figure 1 we see that there was a clear down regulation of the gene TBX4 (id4452) in forelimb in all samples. Looking at figure 2 we see that even though there was an over all significant differential expression combining all replicates there wasn't a clear difference in all samples. With this in mind I plotted all the 22 gene IDs and only 4 had a clear separation for all samples. All 4 belong to the same gene name: TBX4.

Figure 1

good separation

Figure 2

bad separation

Discussion

TBX4 is a transcription factor involved in DNA binding and transcriptional regulation. This gene has previously been shown to be differentially expressed in bats[1-3] and was also found in the paper among other genes involved in DNA binding and transcriptional regulation. This implies that the gene most likely have a function in the differential development between forelimb and hindlimb. For further analysis it could therefore be interesting to look at the pathway of this gene in order to get a more deeper understanding of the development of bat limbs.

In the plots where the clusters were not clearly separated one would need some extra analysis to see if this was due to the different development stages or just artifacts. In general this dataset was generated from restricted sample size, biological variation and gross tissue sampling which might have reduced the power of some of the analyses. Moreover, this data analysis was only based on a subset of the original dataset which explains why I only found 12 genes that were differently expressed in forelimb and hindlimb while the paper found 2452 genes(with the same p-value cutoff).

References

  1. Wang, Z. et al. Unique expression patterns of multiple key genes associated with the evolution of mammalian flight. Proc. Biol. Soc. 281, 20133133 (2014).
  2. Mason, M.K. et al. Retinoic acid–independent expression of Meis2 during autopod patterning in the developing bat and mouse limb. Evodevo 6, 6 (2015).
  3. Wang, Z. et al. Digital gene expression tag profiling of bat digits provides robust candidates contributing to wing formation. BMC Genomics 11, 619 (2010).