Should I remove PCR duplication for RNA seq analysis - bcb420-2023/Jielin_Yang GitHub Wiki
Date: 2023-03-05
Should I remove PCR duplication for RNA-seq analysis?
Although it does not seem to be a major problem for most of the high-quality RNA-seq data, I have observed a few times such that my RNA-seq data contains a considerable level of duplicated reads during pre-alignment quality control. The problem of PCR duplication comes from the fact that when we perfrom gene counting, we are counting the number of reads (most likely different fagments of the same transcript dur to random fragmentation during the library preparation process) that map to a gene. However, the very last step of the library preparation process is to perform PCR amplification to increase the number of reads. This means that the same read will be copied multiple times to increase its presence in the library. Thus, sequencing is essentially randomly sampling the library. When the same sequence that has been amplified multiple times is sampled, it will be counted multiple times once alignment is performed. How do we know that what we count is the true number of reads or is it a result of technical bias?
Interestingly, a 2016 study by Parekh et al. employed three different libray prepreparation protocols and tested whether computational methods can effectively distinguish PCR duplicate from actual biological duplications (likely due to high expression). The result showed that as computational methods are only albe access the seqence of the reads, a genome-coordinate based method does not effectively limit itself to identify duplications solely due to PCR amplification. Instead of increasing the power, removing duplication worsened the power of the analysis as well as the type I error rate. This is because that although PCR amplification is a major source of those duplications, it is not the only contributing factor to the observed duplicated reads. Instead, removing duplications likely decreases the complexity of the data and thus, the power of the analysis.
But is there any method to distinctively identify PCR duplicates? Yes, one method is now commonly used for single-cell RNA-seq or bulk RNA-seq with high sequencing depth. This method is to add a unique molecular identifier (UMI) to the 5' end of each read. This UMI serves as a moleuclar barcode such that its addtion early in the library preparation process directly on the not-so-much processed starting RNA material isolates the PCR duplicates from actual biological duplicates. However, this method complicates the library preparation process and does not seem to highly increase the power of differential expression analysis. Presumably, removing duplication for differential gene expression analysis comparing two conditions under the same experimental condition is not necessary. Rather, such duplications could affect comparing absolute gene expression levels between genes, in which the genes counts are not only affected by the length of the gene, but also the effect of PCR on different relative abundances of the transcript fragments.
The problem of PCR duplication essentially comes back to the complexity of the library. In case of a small amount of starting material, PCR will increase the presense of transcripts that are already have a high abundance. Therefore, when sequencing a ramdom subset of the over amplified library, the same transcript will be more likely sampled than the onces that do not have a high abundance. Therefore, PCR duplication does not present a high problem for increasing the false discovery of differntially expressed genes. Rather, a high amount of duplication redueces the chance of discovering genes that are differentially expressed yet present at a low abundance. Therefore, when analyzing RNA-seq data, it is important to use the quantative matric of duplication rate in conjunction with the bentch-top process of RNA and library preparation to consider the effect of PCR duplication on the identifying differentially expressed genes that do not have a high abundance.
References
DNA Technologies & Expression Analysis Core Laboratory. Should I remove PCR duplicates from my RNA-seq data? [Internet]. [cited 2023 Mar 5]. Available from: https://dnatech.genomecenter.ucdavis.edu/faqs/should-i-remove-pcr-duplicates-from-my-rna-seq-data/
Parekh S, Ziegenhain C, Vieth B, Enard W, Hellmann I. The impact of amplification on differential expression analyses by RNA-seq. Sci Rep. 2016;6:25533. Published 2016 May 9. doi:10.1038/srep25533
Bansal V. A computational method for estimating the PCR duplication rate in DNA and RNA-seq experiments. BMC Bioinformatics. 2017;18(Suppl 3):43. Published 2017 Mar 14. doi:10.1186/s12859-017-1471-9
Fu Y, Wu PH, Beane T, Zamore PD, Weng Z. Elimination of PCR duplicates in RNA-seq and small RNA-seq using unique molecular identifiers. BMC Genomics. 2018;19(1):531. Published 2018 Jul 13. doi:10.1186/s12864-018-4933-1