Predicting functional non coding singe nucleotide polymorphisms (ncSNPs) using multi modal deep learning - rahulmohan/ncSNP-predict GitHub Wiki

Motivation

Non-coding single nucleotide polymorphisms (ncSNPs) play an important role in chromatin structure and variation, and more generally can have an impact on gene regulation. Specifically, unlike coding SNPs that can alter amino acids, ncSNPs can directly affect the transcription of neighboring genes. Their mechanisms to do so include altering transcription binding affinity, promoter methylation, and the functions of other regulatory elements. Furthermore, ncSNPs have been shown to regulate chromatin structure in a sequence-dependent fashion. Being able to distinguish functional ncSNPs from those that are benign has important implications in identifying and treating disease.

Methods

We look to incorporate various types of intra-genomic features into a single framework, to classify ncSNPs as functional or benign (i.e. multi-modal learning). Furthermore, we hope to bypass manual feature construction by utilizing deep learning, such that we can learn novel representations from raw genomic data.

Dataset:


Given that our task is binary classification (functional ncSNP or not), we had to construct a positive and negative set. Our positive set consisted of ncSNPs from the GWAS (Genome Wide Association Studies) catalog. The noncoding GWAS catalog consists of 15,399 total ncSNPs, each of which being highly statistically significant and associated with a particular disease. Generally speaking, the noncoding GWAS catalog is the strongest set of regulatory variants available, only behind HGMD (Human Gene Mutation Database). Our negative set consisted of 15,300 randomly sampled ncSNPs from the 1000 genomes project.

Features:


DNA Sequence

The motivation for using DNA sequence is to investigate whether functional ncSNPs affect sequence specificities in a manner different than those that are benign. We extract the DNA sequence from regions centered around each ncSNP. The sequence consists of four letters - A, C, G, and T, which correspond to each nucleic acid in DNA. We experimented with various sequence lengths, or in other words, the size of the window centered around each ncSNP. We used a 4-hot encoding scheme in order to featurize each sequence. Specifically, each sequence has 4 binary vectors associated with it, which correspond to presence or absence of the different nucleotides. For example, a DNA sequence of length 500, would be encoded as a 4 x 500 binary matrix.

Chromatin Accessibility

Functional ncSNPs should be in regions with high chromatin accessibility. This is because a ncSNP has to be located in open chromatin in order to be important. We used DNaseI signal derived from ENCODE for each of the DNA sequences we extracted. There is a chromatin accessibility value for each base pair. While it is generally known that ncSNPs affect chromatin accessibility, another goal was to identify by how much the variation of chromatin accessibility was explained by functional ncSNPs.

Proximity to Regulatory and Evolutionarily Important Sites

We extracted the distance from each SNP to regulatory, evolutionarily conserved, & evolutionarily selected sites. Our intuition is that GWAS SNPs should generally be closer to these sites compared to non-functional ncSNPs. Here are the various sites we selected for the distance computations:

  • DNase I Hypersensitivity Sites (DHS)
    • Biological motivation - Contain all open chromatin regions and are directly related to transcriptional activity.
  • Distal Regulatory Module (DRM) Enhancers
    • Biological motivation - Regulate the expression of genes located in non-adjacent sites. Specifically, these enhancers are not located in the flanking sequences of genes.
  • Cis Regulatory Module (CRM) Enhancers
    • Biological motivation - Regulate the expression of genes located in adjacent sites. Specifically, these enhancers are located in the flanking sequences of genes.
  • Promoters
    • Biological motivation - Initiate the transcription of a particular gene. ncSNPs might affect the binding affinity of promoters to transcription factors.
  • Transcription Factor Binding Sites
    • Biological motivation - Directly affect the regulation of genes. The intuition is that ncSNPs are more likely to be functional if in the proximity of a binding peak.
  • Nucleosomes
    • Biological motivation - The positioning of nucleosomes plays a key role in determining the access of transcription factors to their binding sites and thus is very important for gene regulation and activation/repression dynamics. The relationship between nucleosome positioning and functional ncSNPs have not been explored; identifying a statistically significant relationship between the two could prove to be a new key biological insight.
  • CpG Islands
    • Biological motivation - The mutation of CpG dinucleotides, the most common location for DNA methylation, has been explored as a mechanism through which SNPs can affect gene regulation. ncSNPs could lead to aberrant methylation in these regions.
  • Histone Modifications
    • Biological motivation - The histone marks H3K27ac,H3K4me1,H3K4me3 (active chromosome) H3K27me3, H3K9me3 (repressed chromatin) play an important role in determining chromatin state. ncSNPs can affect the binding affinities of these histones while also potentially altering chromatin state.
  • Ultra-conserved Regions
    • Biological motivation - Regions that are highly conserved across many phylogenetic branches have been identified. Functional ncSNPs are likely to be closer such regions compared to non-functional ncSNPs.
  • Ultra-selective Regions
    • Biological motivation - Regions that are under high selective pressure, compared to other non-coding regions, have been identified. Although it is unclear what the relationship between selective regions and function/non-functional ncSNPs is, one of the goals of including this feature is to identify if such a relationship exists.