Assignment 1 - shrijas/UjjwalNeogiLab GitHub Wiki

To find out the differentially expressed transcripts between the wildtype and mutant of the given 3 samples each

Introduction

Transcripts from wildtype and Mutants may express differently at different levels. They could be due to various factors.To find the differentially expressed transcripts, we need to download the basic library packages and customise the salmon to obtain the differentially expressed transcripts.
Done using the following:

As transcripts could be many and would arise from one gene, it would be nice to associate it to a gene, so we could have a count table easily.
After creating a vector to the quantification files, the samples have been read and analysed accordingly to associate the transcripts with gene-ID for summary at gene-level. We try to load data for tximport.

To remember that there are two kinds of data available. One is the extdata having the annotation files and the other are the Test data(which we have at the moment) and we integrate them into R studio. Any information on the code is given here.

Methods

Over here I would refer to several methods used (through R script) which involves with differential expression of transcripts. The idea of DESeq2 is very simple. It helps us to find the differentially expressed genes/transcripts. How?
Suppose we want to compare between two groups(like in this example), we build a model for observed counts. The count data are presented as a table which tells us the number of sequencing reads/ fragments that have been assigned to each gene. There's a possibility of error cause due to zero values(causing it to infinity), hence log2 transformation is performed. The log2 transformation is done for the transformed counts per gene.

The first step was loading the given data that contained both the mutant and wildtype samples which were in sf format(6 in total).
After that, the files were named, tximport and readr was called using the library to import the files from the local data to Rstudio respectively. We then tried to count the number of sample/abundance in each class.
It was important to understand the dimensions of the sample that I had (209026 transcripts among the 6 samples) because after various process of filtering and normalisation, the effective number of transcripts would reduce in number to signifcant numbers.
The summary of the plot was also calculated to know how the deviation or how the data was presented.
Histogram of the count plot was then calculated.
log2 transformation done for the transformed counts per gene.
Elimination of undetected genes are performed since the weakly expressed genes might be so low that they might be undetected.This is done by counting the required number using a threshold and then deleting it from the count table.
What is the percentage of the gene having null count per sample. Draw barplot
Normalisation done with modelling readcounts through negative binomial distribution
Mean variance calculation done
Volcano plot to find the differentially expressed transcripts
Find out the differentially expressed genes with most significant and repressed transcripts using the padj value(through FDR) and other criterias.