Filtering the RNAs expression matrix of the contrasting genotypes - labbces/sugarcane_RNAome GitHub Wiki
Why is it necessary?
I quantified the samples from all genotypes for each of the selected papers as described here.
Calculating the correlation between pairs of genes is a challenging task when dealing with millions of genes. The calculation involves combinations of pairs, where the total number of gene pairs is represented by:
$C(n, 2) = \frac{n \cdot (n-1)}{2}$
The expression matrix I am working with has 3,485,561
genes, resulting in 6,074,565,999,580
pairs :dizzy_face:
$C(3,485,561, 2) = \frac{3,485,561 \cdot (3,485,561 -1)}{2} = 6,074,565,999,580$
This makes calculating the correlation between all these pairs computationally impossible challenging. Therefore, it was necessary to apply the strategies below to reduce the size of the matrix before calculating the gene correlation.
Pipeline
I developed the following pipeline in R to import the quantification matrix using tximport
and utilized the file with the pan-transcriptome groups (clustering by similarity of coding RNAs + clustering by similarity of non-coding RNAs) as tx2gene
.
From now on, these groups will be referred to as genes.
Next, I employed DESeq2
to create a DESeqDataSet (DDS) object
using the following function:
dds <- DESeqDataSetFromTximport(txi, colData = samples, design = ~ 1)
Following that, replicates were collapsed using the collapseReplicates
function:
ddsColl <- collapseReplicates(dds, dds$Run, dds$Accession)
After this process, from the initial matrix of 14,982,054
transcripts, I retained 3,485,561
genes (coding and non-coding - tx2gene).
Removing degraded samples
The pipeline for quantifying transcripts against the pan-transcriptome includes a step to remove sequencing adapters, ribosomal RNA (rRNA) sequences, and filter transcripts for quality (Q>20) before quantification.
As we can observe in the metadata tables, Hoang, Correr and Perlo, some samples exhibit high abundances of rRNA. Consequently, samples from genotypes that had more than 30% of sequences removed during quality control were excluded.
withoutDegradedSamples_ddsColl <- ddsColl[, ddsColl$X..Trimmed <= 30]
Removing rows (genes) 100% zeros (no expression betweenn conditions)
Rows (genes) with 100% zeros were removed:
# *** Calculate the proportion of zeros in each row ***
zero_prop <- rowSums(assay(withoutDegradedSamples_ddsColl) == 0) / ncol(assay(withoutDegradedSamples_ddsColl))
# *** Set a threshold for zeros ***
threshold <- 1
# *** Select rows with 100% zeros ***
keep <- zero_prop < threshold
withoutDegradedSamplesAndZeros_ddsColl <- withoutDegradedSamples_ddsColl[keep,]
Due to the size of the dataset we are working with, even after removing genes with zero expression, we still have a substantial number of genes in the expression matrix:
Hoang2017 dataset (1,684,750 coding and non-coding genes):
withoutDegradedSamplesAndZeros_ddsColl
class: DESeqDataSet
dim: 1684750 15
Correr2020 dataset (2,219,820 coding and non-coding genes):
withoutDegradedSamplesAndZeros_ddsColl
class: DESeqDataSet
dim: 2219820 12
Pelo2022 dataset (2,985,198 coding and non-coding genes):
withoutDegradedSamplesAndZeros_ddsColl
class: DESeqDataSet
dim: 2985198 63
Applying Variance Stabilizing Transformation (VST) on the counts
When analyzing RNA-seq data, two crucial properties to consider are the presence of extreme values and the mean-variance dependency (heteroscedasticity). The use of logarithmic transformation or variance-stabilizing transformation (VST) is common to address these issues. Log and VST methods help mitigate extreme values in the skewed distribution of RNA-seq data.
Heteroscedasticity is observed in RNA-seq data, where genes with higher average expression exhibit larger observed variances across samples, indicating varying expression levels from sample to sample. This phenomenon is visualized through per-gene standard deviation plotted against the rank of average expression.
The figure below shows the standard deviation of the transformed data, across samples, against the mean, using the raw counts, the shifted logarithm transformation (log2(n + 1)) and the variance stabilizing transformation.
The shifted logarithm has elevated standard deviation in the lower count range, while for the variance stabilized data the standard deviation is roughly constant along the whole dynamic range.
We chose to apply the variance-stabilizing transformation to the counts after filtering degraded samples and removing zeros, using the following code:
# *** Adding a pseudocount (+1) to counts ***
pseudocount <- 1
dds_counts <- counts(withoutDegradedSamplesAndZeros_ddsColl)
dds_counts_pseudo <- dds_counts + pseudocount
# *** Creating a new DESeqDataSet object with adjusted counts ***
dds_pseudo <- DESeqDataSetFromMatrix(countData = dds_counts_pseudo,
colData = colData(withoutDegradedSamplesAndZeros_ddsColl),
design = ~ 1)
# *** Applying VST to DESeqDataSet object ***
dds_vst <- varianceStabilizingTransformation(dds_pseudo)