Standardization in bioinformatics data deposition - bcb420-2023/Jielin_Yang GitHub Wiki

Date: 2023-02-06

During the past week, I have been working on searching for a gene expression dataset on the NCBI Gene Expression Omnibus (GEO) that would allow me to perform secondary analysis on the published research. Since its establishment in early 2000, GEO has curated many studies, including multiple datasets from my own lab. However, despite the diversity of the data, it has been a hard time identifying a dataset in a systematic way. In particular, keyword searching does not return the best relevant results, and the completeness of the information deposited in the GEO database requires further consideration.

One part of the dataset selection process is searching the database. It is expected that gene expression datasets are less diverse than publications indexed in PubMed, but it is true that GEO lacks a controlled vocabulary to describe the content and type of study associated with a dataset. The MeSH terms are a set of biological terms organized as a tree, which allows a detailed understanding of the definition, categorization, and relationships between a set of keywords. However, MeSH, which is highly useful for systematic literature search in PubMed, is not adopted by GEO such that we could not use one or a few terms to describe and retrieve the set of all studies related to a certain field.

A second problem with the GEO database that I found the most challenging for obtaining a valid dataset is the semi-standardized process of what to deposit. Many guidelines, such as (FAIR Guiding Principles) [https://www.nature.com/articles/sdata201618], have described certain information required for depositing bioinformatics data, and recent GEO entries have been properly checked by the database curator. However, how the data are deposited (i.e. format, sample labelling, the biological information for each sample) is not sticky controlled. As an example, RNA-seq analysis follows a very standardized process at the beginning of the pipeline, which involves quality control (QC), alignment, followed by gene counting of the sequencing samples. More downstream analysis, such as trimming, filtering, normalization, and in particular, performing differential gene expression analysis, limits the reversibility for another researcher to understand the original data and reproduce the result in potentially an alternative pipeline or statistical method. Many data that are deposited in GEO are normalized counts or results of differential gene expression analysis. These data formats limit their reproducibility. In addition, the lack of a clear description of the biological information for a sample further limits the interpretation of between-sample variations, particularly if those samples were obtained from human participants.

Therefore, the problems mentioned above arise from the subjective, rather than objective interpretations of the data depositing guidelines and an existing "buffer region" as to how those guidelines are enforced. I believe that one of the promising ways to resolve the ambiguities in data deposition is to define a set of clear terms that transfer most subjective interpretations to objective ones. Although this may be a high-cost process, adopting defined scientific terms (ontology), such as MeSH, into a database like GEO could at least allow the newly curated dataset to be categorized and defined in a correct set of terms. This would further allow high interpretation of the study, easy search, as well as finding relationships (e.g. similarities) between datasets a less strenuous process.

References

Chervitz SA, Deutsch EW, Field D, et al. Data standards for Omics data: the basis of data sharing and reuse. Methods Mol Biol. 2011;719:31-69. https://doi.org/10.1007/978-1-61779-027-0_2.

Wilkinson MD, Dumontier M, Aalbersberg IJ, et al. The FAIR Guiding Principles for scientific data management and stewardship [published correction appears in Sci Data. 2019 Mar 19;6(1):6]. Sci Data. 2016;3:160018. https://doi.org/10.1038/sdata.2016.18.

Standardizing data. Nat Cell Biol 2008;10:1123–1124. https://doi.org/10.1038/ncb1008-1123.

Data Deposition and Standardization. Nuclic Acid Res. https://academic.oup.com/nar/pages/data_deposition_and_standardization.