Standardization of Gene and Gene Product Nomenclature - bcb420-2023/Jielin_Yang GitHub Wiki

Date: 2023-02-20

AC002456: Does it mean anything to you?

Well, without giving you any context, these are a simple combination of letters and digits that has no meaning to a human reader. Now consider this, what if I tell you that this letter represents a gene? Can you infer anything from it? I bet the answer is still no. The first thing when I look at the letter-digit combination is to search it up online, identify what database it is associated to, and more importantly, which gene it refers to. This action and instinct of anyone that works in the bioinformatics field is a clear indication of the critical importance of having uniquely identified gene IDs and and a standardized nomenclature for gene names that can be mapped to these IDs in a consistent manner.

However, the process of mapping gene IDs to gene names is not as straightforward as it seems. A recent study reported that over 30% of the published studies with supplemental gene lists contains errors in gene names. While most of the errors will not directly contribute to the understanding of the biological processes, since these errors are commonly associated with lowly expressed or non-differentially expressed genes, these errors can be a source of confusion and can lead to incorrect conclusions. In addition, the curation of genes and gene products is a time-consuming process and is typically done by several databases independently due to the funding and resource constraints. As I investigate some of the common databases, including NCBI and the European Nucleotide Archive (ENA), I found that these gene records are mostly provided by the community of researchers. As such, once the submitting researcher modifies some of the information associated with a gene record, the system usually automatically generates a new version number associated with the gene. Although the final curation will eventually process the records and merge multiple version into a single one due to redundency, this process could take a long time. It is expected that many tools, such as those for alignment and gene counting, could update their versions more frequently. As a results, these tools may not be incorporating the stable version of the annotation sources.

For example, when I am performing initial data processing of my dataset choice for BCB420, I discovered that although the genes are already given in HGNC symbols, many of them are still referred to by ENA IDs with version numbers. Although it is only a few months after publication, many of the remaining IDs were unable to be searched against the ENA database. Therefore, while using non-standardized annotation sources it is likely that transcripts from different locations of the same gene are mapped to various versions (which are clearly being used as unique genes by the authors) simply due to duplicated submission of the gene records or some minor fix in the associated content.

In addition, another problem for ID mapping between databases is that such mapping may not be unique, or that such mapping is itself an error. Let us examine the same example from above. AC002456 is orginally indexed in the European Nucleotide Archive (ENA) as Homo sapiens BAC clone CTB-13L3. By searching it against Ensembl, we discovered that it is mapped to ENSG00000157224 and ENSG00000058091 that represent CLDN12 and CDK14, respectively. Not only are the two mapped gene involved in unrelated cellular processes, the original gene ID does not provides a clear description of what the gene is and does not seem to be correctly mapped to the associated genes.

So, what are some of the solutions? As mentioned above, the lack of standardization across databases exist, and these problems cannot be solved by the end-users of the annotation resources. However, these problems do not appear in very well-characterized genes, which means that most problems exist in lowly expressed genes that can be filtered out. Therefore, it is always important to use the most up-to-date annotation source, such as the latest release of the reference genome, and to properly remove genes that are determined to be statistically insignificant. However, if the problem still persists, keeping the more unique gene IDs, such as the Ensembl gene IDs, can help to reduce the ambiguity of the gene names when it is clear that these mapped gene IDs are not the same gene. This would keep the possibilty of discovering less characterised genes that may be involved in the biological process of interest such that we will not simply loss a lowly expressed yet physioologically important gene. I believe that it is also critical to preserve the history of gene ID processing such that we are able to trace back to the original transcripts identified from alignment and gene counting. This will help to reduce the possibility of errors in the gene ID mapping process.

References

Abeysooriya M, Soria M, Kasu MS, Ziemann M. Gene name errors: Lessons not learned. PLoS Comput Biol. 2021;17(7):e1008984. Published 2021 Jul 30. doi:10.1371/journal.pcbi.1008984

Fundel K, Zimmer R. Gene and protein nomenclature in public databases. BMC Bioinformatics. 2006;7:372. Published 2006 Aug 9. doi:10.1186/1471-2105-7-372

Fujiyoshi K, Bruford EA, Mroz P, et al. Opinion: Standardizing gene product nomenclature-a call to action. Proc Natl Acad Sci U S A. 2021;118(3):e2025207118. doi:10.1073/pnas.2025207118

Yang J. Data cleaning and identifier mapping. (2023) https://github.com/bcb420-2023/Jielin_Yang/wiki/Dataset-Cleaning-and-Identifier-Mapping

ajay nair. (2020) BioStar: Different Ensembl Ids point to the same gene symbol. https://www.biostars.org/p/389804/

Tobias Haar. (2019) ResearchGate. How to deal with multiple ensemble IDs mapping to one gene symbol in a RNA-Seq dataset? https://www.researchgate.net/post/How-to-deal-with-multiple-ensemble-IDs-mapping-to-one-gene-symbol-in-a-RNA-Seq-dataset