Filtering the RNAs expression matrix of the contrasting genotypes - labbces/sugarcane_RNAome GitHub Wiki

Why is it necessary?

I quantified the samples from all genotypes for each of the selected papers as described here.

Calculating the correlation between pairs of genes is a challenging task when dealing with millions of genes. The calculation involves combinations of pairs, where the total number of gene pairs is represented by:

$C(n, 2) = \frac{n \cdot (n-1)}{2}$

The expression matrix I am working with has 3,485,561 genes, resulting in 6,074,565,999,580 pairs :dizzy_face:

$C(3,485,561, 2) = \frac{3,485,561 \cdot (3,485,561 -1)}{2} = 6,074,565,999,580$

This makes calculating the correlation between all these pairs computationally ~~impossible~~ challenging. Therefore, it was necessary to apply the strategies below to reduce the size of the matrix before calculating the gene correlation.

Pipeline

I developed the following pipeline in R to import the quantification matrix using tximport and used the file with the pan-transcriptome genes (coding genes + ncRNAs genes) as tx2gene.

Next, I employed DESeq2 to create a DESeqDataSet (DDS) object using the following function:

dds <- DESeqDataSetFromTximport(txi, colData = samples, design = ~ 1)

Following that, replicates were collapsed using the collapseReplicates function:

ddsColl <- collapseReplicates(dds, dds$Run, dds$Accession)

After this process, from the initial matrix of 14,982,054 transcripts, I retained 3,485,561 genes (coding and non-coding - tx2gene).

Removing degraded samples

The pipeline for quantifying transcripts against the pan-transcriptome includes a step to remove sequencing adapters, ribosomal RNA (rRNA) sequences, and filter transcripts for quality (Q>20) before quantification.

[!NOTE] As we can observe in the metadata tables, Hoang, Correr and Perlo, some samples exhibit high abundances of rRNA. Consequently, samples from genotypes that had more than 30% of sequences removed during quality control were excluded.

withoutDegradedSamples_ddsColl <- ddsColl[, ddsColl$X..Trimmed <= 30]

Removing rows (genes) 100% zeros (no expression betweenn conditions)

Rows (genes) with 100% zeros were removed:

# *** Calculate the proportion of zeros in each row ***
zero_prop <- rowSums(assay(withoutDegradedSamples_ddsColl) == 0) / ncol(assay(withoutDegradedSamples_ddsColl))

# *** Set a threshold for zeros ***
threshold <- 1

# *** Select rows with 100% zeros ***
keep <- zero_prop < threshold
withoutDegradedSamplesAndZeros_ddsColl <- withoutDegradedSamples_ddsColl[keep,]

[!NOTE] Due to the size of the dataset we are working with, even after removing genes with zero expression, we still have a substantial number of genes in the expression matrix:

Hoang2017 dataset (1,684,750 coding and non-coding genes):

withoutDegradedSamplesAndZeros_ddsColl

class: DESeqDataSet 
dim: 1684750 15

Correr2020 dataset (2,219,820 coding and non-coding genes):

withoutDegradedSamplesAndZeros_ddsColl

class: DESeqDataSet 
dim: 2219820 12

Pelo2022 dataset (2,985,198 coding and non-coding genes):

withoutDegradedSamplesAndZeros_ddsColl

class: DESeqDataSet 
dim: 2985198 63

Applying Variance Stabilizing Transformation (VST) on the counts

When analyzing RNA-seq data, two crucial properties to consider are the presence of extreme values and the mean-variance dependency (heteroscedasticity). The use of logarithmic transformation or variance-stabilizing transformation (VST) is common to address these issues. Log and VST methods help mitigate extreme values in the skewed distribution of RNA-seq data.

Heteroscedasticity is observed in RNA-seq data, where genes with higher average expression exhibit larger observed variances across samples, indicating varying expression levels from sample to sample. This phenomenon is visualized through per-gene standard deviation plotted against the rank of average expression.

The figure below shows the standard deviation of the transformed data, across samples, against the mean, using the raw counts, the shifted logarithm transformation (log2(n + 1)) and the variance stabilizing transformation.

The shifted logarithm has elevated standard deviation in the lower count range, while for the variance stabilized data the standard deviation is roughly constant along the whole dynamic range.

We chose to apply the variance-stabilizing transformation to the counts after filtering degraded samples and removing zeros, using the following code:

# *** Adding a pseudocount (+1) to counts ***
pseudocount <- 1
dds_counts <- counts(withoutDegradedSamplesAndZeros_ddsColl)
dds_counts_pseudo <- dds_counts + pseudocount

# *** Creating a new DESeqDataSet object with adjusted counts ***
dds_pseudo <- DESeqDataSetFromMatrix(countData = dds_counts_pseudo,
                                     colData = colData(withoutDegradedSamplesAndZeros_ddsColl),
                                     design = ~ 1)

# *** Applying VST to DESeqDataSet object ***
dds_vst <- varianceStabilizingTransformation(dds_pseudo)