Dataset Cleaning and Identifier Mapping - bcb420-2023/Jielin_Yang GitHub Wiki

Objectives

  • Download the selected dataset from GEO
  • Perform basic exploratory data analysis
  • Identifier mapping to HGNC Symbols
  • Perform distribution analysis of the dataset
  • Normalize the dataset using a suitable method

Time management

Time estimated: 48 hrs, taken: 72 hrs.

Start date: 2023-02-08, End date 2023-02-14.


Resources

This journal page documents the progress of assignment 1 in the ongoing basis. Part of the process (dataset selection) is documented in the following journal pages and the R Notebook:

Dataset Selection:

Journal page: 2.1 Gene Expression Dataset Selection

R Notebook: Gene-Expression-Dataset-Selection.Rmd

Dataset Cleaning and Identifier Mapping

The current journal page documents progress consistant with the dataset selection and identifer mapping part of assignment 1, where the R notebook is liked as follows: Assignment 1 - Data cleaning and identifier mapping

Software and Packages

  • RStudio and base environment under docker image risserlin/bcb420-base-image (see here)
  • GEOquery
  • edgeR
  • limma
  • biomaRt
  • stringr
  • kableExtra
  • dplyr
  • DT
  • ggplot2
  • reshape2
  • here

Procedure and Results

Downloading the dataset

In the previous part of the process, the dataset GSE203358 is selected. It is downloaded from the GEO server using the GEOquery packaged.

Note that the data is in the supplimentary file, which is different from the download data option in GEOquery.

This dataset contain raw counts of 24 samples, each is labelled by the treatment conditions of the samples. The samples are labelled as follows:

  • CTRL: control
  • VTP: verteporfin treatment only
  • TGFb: TGFb treatment only
  • TGFB_VTP: TGFb and verteporfin treatment

Exploratory data analysis

We observed the original IDs of the genes in the dataset. There was no duplicated gene ID in the dataset. This is because that the genes are labelled in mix of HGNC symbols and ENA accession numbers. This accesion numbers is the same as the Genbank id, which are referred interchangeably in the report.

Formart of the accession numbers:

The formats are in a mix of the current and legacy versions:

  • Current version: AC002456.1
  • Legacy version: U00096.2

The current version uses 2 letters followed by 6 digits, while the legacy version uses 1 letter followed by 5 digits. The current version is the most common format in the dataset. Therefore selecting both the current and legacy version in the dataset requires the following regular expression:

[A-Z]{1,2}[0-9]{5,6}.[0-9]{1,2}

However we observed some other genes that do not match this regular expression. Yet it is correctly indexed in the ENA database. Therefore we have to manually check the accession numbers that do not match the regular expression.

Version Numbers:

Interestingly, somehow the dataset contain reads that are mapped to multiple versions of the same gene. This means that there are entries in the dataset that have the same gene ID but different version numbers.

We have checked a few of those entries manually and discovered that the version numbers are not the same as the version numbers in the ENA database. Some of the versions of the genes are not able to be found in the ENA database.

We did not have the resource to manually check whether all of those version of duplicated genes are correct, i.e. we cannot search each of them online manually.

Some manual checking with the gene expression values of the genes indicate that such genes have usually very low expression values. Therefore we decided combind the duplicated genes with the same gene ID but different version numbers into one gene ID, and the expression value (raw counts) of the gene is the sum of the expression values of the duplicated genes. This is performed after the identifier mapping.

Identifier mapping

We used the biomaRt package to map the accession numbers to HGNC symbols. This means that the Ensembl database was used.

Parameters used in the mapping:

  • dataset = "hsapiens_gene_ensembl": the dataset used in the mapping
  • filters = "embl": this is the filter for ENA accession numbers
  • attributes = c("embl", "hgnc_symbol"): the ENA accession number and the HGNC symbol both needs to be returned to allow merging the identifiers back to the dataset

A first attempt to map the accession numbers to HGNC symbols did not retrieve any result. This because that embl only matches the first 6-8 characters of the accession number. The version number cannot be included in the mapping. Therefore we have to trim the version number from the accession numbers before mapping.

After mapping, the result shows a lot more entries then the number of ENA id we sent for query. This is because that many ENA accession numbers are mapped to multiple HGNC symbols. Further manual checking shows that the gene also maps to multiple ensemble IDs. The following is an example of the search:

Gene stable ID Gene stable ID version Transcript stable ID Transcript stable ID version European Nucleotide Archive ID
ENSG00000157224 ENSG00000157224.16 ENST00000416322 ENST00000416322.5 AC002456
ENSG00000157224 ENSG00000157224.16 ENST00000394604 ENST00000394604.5 AC002456
ENSG00000157224 ENSG00000157224.16 ENST00000427904 ENST00000427904.1 AC002456
ENSG00000058091 ENSG00000058091.17 ENST00000449528 ENST00000449528.5 AC002456
ENSG00000058091 ENSG00000058091.17 ENST00000446224 ENST00000446224.5 AC002456
ENSG00000058091 ENSG00000058091.17 ENST00000430760 ENST00000430760.5 AC002456
ENSG00000058091 ENSG00000058091.17 ENST00000456689 ENST00000456689.5 AC002456
ENSG00000058091 ENSG00000058091.17 ENST00000380050 ENST00000380050.8 AC002456
ENSG00000058091 ENSG00000058091.17 ENST00000446790 ENST00000446790.5 AC002456
ENSG00000058091 ENSG00000058091.17 ENST00000265741 ENST00000265741.7 AC002456

The same ENA accession number AC002456 maps to 10 different ensemble transcript IDs and 2 gene IDs. This is because that the same gene can have multiple transcripts.

For these multi-mapped ENA accession numbers, we decided not to map them to HGNC symbols. Instead we will keep the ENA accession numbers in the dataset. This is because that the ENA accession numbers are more specific than the HGNC symbols. The HGNC symbols are not unique, while the ENA accession numbers are unique.

Normalization

We examined the distribution of the log2 transformed CPM values of the genes. Both boxplots and density plots show relative consistent distributions across the samples. Since we are operating on the count values and assume that most genes are not differentially expressed, we decided to use the Trimmed Mean of M values (TMM) method to normalize the count values. This method is designed to handle the case where the distribution of the count values is not normal.

The TMM method is implemented in the edgeR package.

After normalization, the distribution of the samples are more consistent with reduced variations.

Multidimensional scaling

We generated MDS plot using the limma package. The plot shows that the samples are clustered by the treatment groups. This is highly signficiant for the control group. The other three treatment groups are not as distinct as the control group. Yet, they are still clustered together. We did not see a clear cluster for the same sample to be clustered together. This suggests that biological variation is not the main reason for the clustering.

Knitting the document

The R Notebook was knitted to a markdown file using the bcb420 docker image:

docker run -v ${PWD}:/home/rstudio/projects --user rstudio risserlin/bcb420-base-image /usr/local/bin/R -e "rmarkdown::render('/home/rstudio/projects/Jielin_Yang/Dataset selection and initial processing/Assignment_1_Data_cleaning_and_identifier_mapping.Rmd',output_file='/home/rstudio/projects/Jielin_Yang/Dataset selection and initial processing/Assignment_1_Data_cleaning_and_identifier_mapping.html')" > processing_output_filename

Note: this creates a new docker container based on the image. The orignal command does not run, where the first two parameters are removed. To let the command run, the entire file path must be encosed in single quotes. Also, the {PWD} must not be quoted.

Conclusion

We claim that the dataset has sufficient quality and data coverage to perform downstream differential gene expression analysis. It has several problems with ID mapping, since we do not have the original Ensembl ID generated during alignment. However, we do expect that our diffentially express genes have the correct HGNC identifiers. We also expect that the gene expression values are normalized and ready for downstream analysis.

Outlook

ID mapping is the most complicated part of the process, especially when the original ID is not available. The mapping using the ENA id are not the optimal option. We will need to find a better way to map the ENA id to the ensemble id, in particular, these two curators of gene do not regularly check for their effectiveness for ID convertion. Additional resources for ID mapping could be considered to better map the ENA id to HGNC symbols.

References

Savarese G, Becher PM, Lund LH, Seferovic P, Rosano GM, Coats AJ. Global burden of heart failure: A comprehensive and updated review of epidemiology. Cardiovascular research. 2022;118(17):3272-3287.

Garoffolo G, Casaburo M, Amadeo F, et al. Reduction of cardiac fibrosis by interference with YAP-dependent transactivation. Circulation Research. 2022;131(3):239-257.

Porter KE, Turner NA. Cardiac fibroblasts: At the heart of myocardial remodeling. Pharmacology & therapeutics. 2009;123(2):255-278.

Tallquist MD, Molkentin JD. Redefining the identity of cardiac fibroblasts. Nature Reviews Cardiology. 2017;14(8):484-491.

Perestrelo AR, Silva AC, Oliver-De La Cruz J, et al. Multiscale analysis of extracellular matrix remodeling in the failing heart. Circulation research. 2021;128(1):24-38.

Brusatin G, Panciera T, Gandin A, Citron A, Piccolo S. Biomaterials and engineered microenvironments to control YAP/TAZ-dependent cell behaviour. Nature materials. 2018;17(12):1063-1075.

Panciera T, Azzolin L, Cordenonsi M, Piccolo S. Mechanobiology of YAP and TAZ in physiology and disease. Nature reviews Molecular cell biology. 2017;18(12):758-770.

Elosegui-Artola A, Andreu I, Beedle AE, et al. Force triggers YAP nuclear entry by regulating transport across nuclear pores. Cell. 2017;171(6):1397-1410.

Francisco J, Zhang Y, Jeong JI, et al. Blockade of fibroblast YAP attenuates cardiac fibrosis and dysfunction through MRTF-a inhibition. Basic to Translational Science. 2020;5(9):931-945.

Giraud J, Molina-Castro S, Seeneevassen L, et al. Verteporfin targeting YAP1/TAZ-TEAD transcriptional activity inhibits the tumorigenic properties of gastric cancer stem cells. International Journal of Cancer. 2020;146(8):2255-2267.

Piersma B, Bank RA, Boersema M. Signaling in fibrosis: TGF-β , WNT, and YAP/TAZ converge. Frontiers in medicine. 2015;2:59.

Wang J, Karra R, Dickson AL, Poss KD. Fibronectin is deposited by injury-activated epicardial cells and is necessary for zebrafish heart regeneration. Developmental biology. 2013;382(2):427-435.

Hua X, Wang YY, Jia P, et al. Multi-level transcriptome sequencing identifies COL1A1 as a candidate marker in human heart failure progression. BMC medicine. 2020;18(1):1-16.

Umbarkar P, Ejantkar S, Tousif S, Lal H. Mechanisms of fibroblast activation and myocardial fibrosis: Lessons learned from FB-specific conditional mouse models. Cells. 2021;10(9):2412.

Sha Y, Phan JH, Wang MD. Effect of low-expression gene filtering on detection of differentially expressed genes in RNA-seq data. In: 2015 37th Annual International Conference of the IEEE Engineering in Medicine and Biology Society (EMBC). IEEE; 2015:6461-6464.

Anders S, McCarthy DJ, Chen Y, et al. Count-based differential expression analysis of RNA sequencing data using r and bioconductor. Nature protocols. 2013;8(9):1765-1786.

Pohjolainen L, Ruskoaho H, Talman V. Transcriptomics reveal stretched human pluripotent stem cell-derived cardiomyocytes as an advantageous hypertrophy model. Journal of Molecular and Cellular Cardiology Plus. 2022;2:100020.

Deschamps-Francoeur G, Simoneau J, Scott MS. Handling multi-mapped reads in RNA-seq. Computational and structural biotechnology journal. 2020;18:1569-1576.

Li B, Ruotti V, Stewart RM, Thomson JA, Dewey CN. RNA-seq gene expression estimation with read mapping uncertainty. Bioinformatics. 2010;26(4):493-500.

Hart T, Komori HK, LaMere S, Podshivalova K, Salomon DR. Finding the active genes in deep RNA-seq gene expression studies. BMC genomics. 2013;14:1-7.

Robinson MD, Oshlack A. A scaling normalization method for differential expression analysis of RNA-seq data. Genome biology. 2010;11(3):1-9.

Ahmed A, Syed JN, Chi L, et al. KDM8 epigenetically controls cardiac metabolism to prevent initiation of dilated cardiomyopathy. Nature Cardiovascular Research. Published online February 13, 2023. doi:10.1038/s44161-023-00214-0