Gene Expression Dataset Selection - bcb420-2023/Jielin_Yang GitHub Wiki

Objectives

  • Selecting an appropriate RNA-seq dataset from GEO using the GEOMetadb package

Time management

Time estimated: 4h, taken: 10h.

Start date: 2023-01-25, End date 2023-02-02.


Resources

Link to R Notebook

Gene-Expression-Dataset-Selection.Rmd

Software and Packages

  • RStudio and base environment under docker image risserlin/bcb420-base-image (see here)
  • GEOmetadb package from Bioconductor

Procedure and Results

Install the GEOMetadb package and generate the GEOmetadb.sqlite database

The GEOMetadb package is installed from Bioconductor.

The GEOmetadb.sqlite database is generated in R, with the following specifications at the time of download:

  • Size: 13592906752 bytes (12.9 GB)
  • Date: 2021-01-25

To connect to the database, the following packages are required:

library(DBI)
library(RSQLite)
# For connection
DBI::dbConnect(RSQLite::SQLite(), "GEOmetadb.sqlite")

These packages are already included in the docker image.

Basic database queries

The data contain 11 tables.

> DBI::dbListTables(con)
 [1] "gds"               "gds_subset"        "geoConvert"        "geodb_column_desc"
 [5] "gpl"               "gse"               "gse_gpl"           "gse_gsm"          
 [9] "gsm"               "metaInfo"          "sMatrix"           

where the gse table contains the metadata for the series, and the gsm table contains the metadata for the samples.

SQL queries can be run on the database, using the dbGetQuery function.

DBI::dbGetQuery(con, "SELECT * FROM series")

The second argument is a string containing the SQL query. For constructing the actual query for dataset selection, this query is constructed separately as a string, and then passed to the dbGetQuery function.

Constructing the query with specific criteria

A query for the GEOmetaDB database is constructed with the following SQL query:

SELECT DISTINCT
    gse.title,
    gse.gse,
    gpl.title,
    gse.submission_date,
    gse.supplementary_file
FROM
    gse
JOIN
    gse_gpl
ON
    gse_gpl.gse = gse.gse
JOIN
    gpl
ON
    gse_gpl.gpl = gpl.gpl
WHERE
    gse.submission_date > '2018-01-01'
AND
    (
        gse.title LIKE '%heart%'
        OR gse.title LIKE '%cardiac%'
        OR gse.title LIKE '%cardiomyocyte%'
    )
AND
    (
        gse.title LIKE '%regenerat%'
        OR gse.title LIKE '%proliferat%'
        OR gse.title LIKE '%fail%'
    )
AND
    gpl.technology LIKE '%high-throughput%'
AND
    gpl.organism LIKE '%Homo sapiens%'

Here, we are looking for unique series that are:

  • from 2018 onwards (within the last 5 years)
  • use the high-throughput sequencing technology
  • and originate from human samples

To identify datasets of particular interest, keywords are used to filter the results, where the correct combination of keywords must appear in the title of the series:

  • include one of the keywords related to heart: heart, cardiac, cardiomyocyte
  • AND include one of the keywords related to regeneration: regenerat, proliferat, fail

The search strategy allows for the inclusion of datasets that are not directly related to heart regeneration, but are related to heart failure or cardiac cell proliferation.

Query Results

The query returns 21 entries, where some of the entries are duplicates. This is because a same series can contain multiple platforms, and the query returns one entry for each platform.

The exact method behind the series and how the samples are collected/categorised were obtained by reviewing the details of the series on the GEO website.

Although the query specifies that the technology used must be high-throughput, the query returns series that utilized high-throughput sequencing for expression analysis, methylation analysis, and protein-binding.

When reviewing the series in detail, the following creteria are used to exclude the series that are not compatible with the downstream analysis:

  • The series must contain at least 10 samples
  • The study utilized single-cell RNA-seq, which is not compatible with the downstream analysis due to the size of the dataset
  • Grouping variable of the samples is not available, i.e., we cannot effectively compare the samples for differential expression analysis
  • Replicates are not available, or, there are less then 3 replicates for each group
  • Supplementary files are not available, or, the supplementary files are not compatible with the downstream analysis

By applying the above creteria, 3 datasets satisfy the criteria and are tentitively selected for downstream analysis:

Title GSE Submission Date
Screening in Human Cardiac Organoids Identifies a Requirement for the Mevalonate Pathway in Cardiomyocyte Proliferation GSE111853 2019-03-20
Multi-level transcriptome sequencing identifies COL1A1 as a candidate marker in human heart failure progression GSE135055 2020-01-08
RNA sequencing of the left ventricle from non-failing donors and heart failure samples from the MAGNet consortium GSE141910 2019-12-13

Comparatively, the first dataset addresses some more interesting question, as the data provides the changes in gene expression pattern when cells are exposed to different compounds. However, this dataset has relatively low numbers of replicates. The second dataset provides a good comparison between the normal and failing samples, and the third dataset provides a good comparison between the failing samples from different patients. The last two datasets have a considerable number of replicates, in which the sample satisfies statistical analysis from the clinical research perspective.

Upon detailed examination of the datasets, none of the three datasets are accepted because either the dataset is not provided in unnormalized count or lack publications related.

We have performed additional searched with the database, but the new studies we obtained are not compatible with the downstream analysis agian. Therefore, we decided to search online directly using the GEO website.

GEO website search

The GEO website search is performed with the following keywords:

  • heart regeneration
  • heart failure
  • cardiac fibrosis and we limited the search to the last 5 years, with filtering the number of samples to be at least 10.

Many studies have been found, but the majority of the studies are not compatible with the downstream analysis for the same reasons as above. However, we have identified one study:

The dataset is titled Reduction of Cardiac Fibrosis by Interference With YAP-Dependent Transactivation, uploaded on Jun 30, 2022. The related paper has been published in Circulation Research: Reduction of Cardiac Fibrosis by Interference With YAP-Dependent Transactivation.

This dataset contains a total of 24 samples, with 6 repilcates for each group. Each group involves different treatments to cardiospheres derived primitive cardiac stromal cells, which is good candidate for analyzing its effect on cardiac fibrosis. The dataset is provided in unnormalized count. Therefore, this dataset is accepted for downstream analysis.

Conclusion

Finding an appropriate dataset for downstream analysis is a challenging task. In particular, there lack a controlled vocaulary for describing datasets, and there lack MeSH terms to properly categorize the main point of the studies. Therefore, checking the primary literature related to the dataset is a very important way to ensure the quality of the dataset, and it is also critical to download the dataset to see if it is compatible with our desired analysis.

Outlook

Think about what could be important to consider when depositing a dataset that promotes reusability and reproduction of the results. Additionally, given the purpose of the study, since RNA-seq does not take the majority part of the study, what is the best way to analyze the data to support the main point of the study?

References

Garoffolo G, Casaburo M, Amadeo F, et al. Reduction of Cardiac Fibrosis by Interference With YAP-Dependent Transactivation. Circ Res. 2022;131(3):239-257. doi:10.1161/CIRCRESAHA.121.319373