ENCODE: Preparing a Database for a Foundations Model - s-joshid/bioinformatics_projects GitHub Wiki

ENCODE

Background

ENCODE is an international collaboration, funded by the National Human Genome Research Institute (NHGRI), aiming to identify all functional elements in human and mice genomes. They have vast amounts of freely available data, all of which have adhere to their standardization guidelines. Key assay types include RNA-seq, ChIP-Seq, DNase-seq, ATAC-seq, WGBS, and RAMPAGE. For this project I have focused on total RNA sequencing data from human donors. This assay method includes coding and non-coding RNA transcripts which can aid in discovering novel RNA types, and provides insight into the mechanism behind transcription and translation. I have developed 2 scripts, one for downloading and formatting my data and another for preliminary data explorations.

This project was done with a classmate for the BF550 course at Boston University, Dec 2024

Filtering and Formatting

My download script, get_total_rna_seq.py, filters to keep the files I want to download. In this case I keep total RNA seq files from Homo Sapiens tissue samples, containing no sequencing warnings or errors. Files have 3 attempts to download before being skipped. Once all files have been downloaded, they are combined into one big .tsv file containing all experimental data and source.

Preliminary data exploration

data_analysis_RNA_seq.py explores the resulting combined .tsv. The total RNA seq results are given as transcripts per million (TPM) and therefore have already been normalized. In this script I explore the longest gene in the dataset, the most common genes among each of the life stages: adult, child, and embryonic, as well as which of these common genes are unique to each life stage.

The longest gene was a non-coding RNA, hellpar, associated with the HELLP syndrome, a rare and serious pregnancy complication.

Exploring the most common genes in this data, I found there to be 15,671 transcripts expressed in all embryonic samples with 604 of these uniquely expressed in this particular life stage. There were 19,120 transcripts expressed in all of the children's samples, of these 3,935 were unique to this life stage. As for adults, there were 10,325 transcripts expressed in all samples from this life stage and none were uniquely expressed.

To showcase how this database can be explored in detail, I choose to explore the expression of the TP53 gene across different tissues and over different ages. I was interested in this protein as it is an important regulatory protein that is often mutated in cancer patients. For this exploration I added an additional life stage by separating adults into under and over 50 as age can be a confounding factor for gene expression. Below I have the expression of TP53 across these 4 life stages.

I then plotted TP53 expression across tissue types and subset by age groups to explore how this gene expresses throughout different tissue types. This plot is shown below.