Annotate and Score DMRs - GianlucaMattei/methyl.O GitHub Wiki

Annotate DMRs:

R:

Genes and gene’s features can be annotated by methyl.O using the function annotateDMRs. An example dataset is included in the package and can be loaded by

data("DMRsSubset", package="methyl.O")

The function uses as default options Ensembl database to annotate and the hg19 version of human assembly. It is possible also to use hg38 for Ensembl or UCSC database and hg19 to annotate. At the moment UCSC can not be used to annotate the hg38 version of the human assembly. The settings for annotation permit to customize results. Running annotateDMRs allows: - to modify the length of promoters (prom.length, default 1500bp) and the length of the heads (head.length, default 1500bp), - to decide whether to use the longest transcript for each gene or to perform the analysis on all transcripts for each gene (longest.trx, default TRUE), - to define the database to be used for annotation (annotation, default = Ensembl), - to set the assembly version (hg, default hg19), the beta threshold for each DMRs to be considered in the annotation (thr.beta, default = 0.3) and the percentage threshold for CpG Islands (CGIs) to be considered differentially methylated(thr.cgis, default = 0.4). Annotation.fast option concerns the method to return the gene Symbols. In fact, the ranges used as input are first annotated as Ensembl gene IDs, if Ensembl is used, or as Entrez gene IDs if UCSC is used, which in turn can be converted 1:1 to Symbols, speeding up the process, or can be 1:many (or many:1) for more accurate IDs conversion (annotation.fast, default = TRUE). The last three options permit, if the input data are not correctly formatted, to select the column number where to find the beta difference (col.betadiff, default 4), needed to perform the analysis and the columns of beta values of the two compared samples (col.beta1 and col.beta2, default = NULL), recommended but not necessary.

annotatedDMRs <- annotateDMRs(DMRsSubset , prom.length=1500, head.length=1500, longest.trx=TRUE, 
annotation='ensembl', hg='hg19', annotation.fast=TRUE, thr.beta=.3, thr.cgis=.4,
col.betadiff = 4, col.beta1 = NULL, col.beta2 = NULL)

GUI:

In the “Annotate Methylated Regions” tab is possible to annotate the DMRs.The same page also displays the ranked list of the genes with scores, but this function will be explained in the next paragraph. On the left side are placed the main settings, while e the results are displayed in the main part of the pag.

Command Description
Select Annotations to Use specifies the database to use for annotation
Select Assembly version Set the assembly version to use
Use Longest transcript Yes to use the longest transcript for each gene or use all the transcripts for each gene
Compute Fast Annotation Select the type of conversion to gene symbol from ensembl gene IDs, if ensemble is used, or from entrez gene IDs if UCSC is used, It can be setted to 1:1, speeding up the process, or can be 1:many (or many:1) for more accurate IDs conversion
Select Promoter Lengths Set the promoter lengths
Select Head Lengths Set the head lengths
Length Percentage of Altered Methylated CGIs Set the length, in percentage, of overlaps between the the DMRs and the CGIs to be considered during the analysis
Beta Diff. Threshold The beta threshold for each DMRs to be considered in the annotation
Column Position of Beta Diff. Select the column number where to find the beta difference
Column's Position of Beta Values of Sample 1 Select the column number where to find the beta values of sample 1
Column's Position of Beta Values of Sample 2 Select the column number where to find the beta values of sample 1
Table 2: Parameters for Annotate Methylated Regions tab

In the GUI we also implemented some plots to have a quick overview on results. These plots are in order: the beta difference values distribution by chromosome, the number distribution of DMRs by chromosome, the distribution of widths, ranks, beta values and database scores, for each annotation (Genes, Heads, TSS surrounding, Promoters, Exons, 5’UTRs, 3’UTRs, Introns). Two additional annotations can be plotted (First exons, First introns). The last plot compares the number of DMRs for each annotated region. Each of these plots has some specific customization settings by clicking on the red gear button. Other common settings are placed in the left column of the page:

Command Description
Met Width Min Minimum methylation width threshold in bp
Met Width Max Maximum methylation width threshold in bp
Selected Feature Percentage Min Minimum methylation width threshold in %
Selected Feature Percentage Max Maximum methylation width threshold in %
Feature Rank Min Minimum rank threshold
Feature Rank Max Maximum rank threshold
Table 3: Additional parameters for plots and for results table in Annotate Methylated Regions tab

The last two options, Feature Rank Min and Max refer to the ranking position of the features in the gene model. For example, considering the exons, the first is rank 1, the second rank 2 and so on. The same for the introns. Of course these parameters are not useful for other features as promoters, heads and other non rankable features, thus they will not affect the plots related to these features.

Score Methylated Regions:

methyl.O implements a customizable score system aiming to suggest the most impacting DMRs. This score system is based on both databases and overlapped regions. In fact, the databases return the implication of the overlapped genes in certain pathologies as well as the presence of regulatory elements as CGIs and TF, while the different overlapped regions may affect the expression in different ways. The returned score aims to integrate these two informations. Overlapped regions affecting the gene expression can be set and must be one or more elements from annotation results, first exons and first introns. The two pieces of information can be weighted by the option score modifier. The score modifier ranges from 0 to 1, where 0 returns a score based on the database only and 1 a score based on the overlapped features, focusing the results on the effects of DMRs on gene expression.

R:

The function scoreAnnotatedDMRs accepts as input object the list resulting from annotateDMRs, as input options active.features which specifies the features considered to affect most the gene expression and the score.modifier,

annotatedDMRs <- scoreAnnotatedDMRs(annotatedDMRs, active.features = c("promoters", "heads"), score.modifier = 0.5)

GUI:

The score is computed automatically during the annotation process and is displayed in the Annotate Methylated Regions tab. Active features can be selected by the red dashboard at the top of the page while the score weights can be modified by the slider on the left panel.

Results of annotateDMRs and scoreAnnotatedDMRs:

R / GUI:

In R the results from annotations are stored in a list object where each element corresponds to a feature. The same object is shown in the GUI where the elements, therefore each feature, can be displayed by clicking on the red gear button. The results shown are affected by parameters explained in the table 3. An example of results retrieved by R or by the GUI, is shown below. The table shows the annotations for genes, the first element of the list of results, but the same columns can be found in the other elements of the list.

seqnames start end width beta gene.start gene.end gene.width gene.strand gene.id tx.namegenes.perc tag dgv gnomad NCG NCG_type COSMIC cosmic.CGIs hacer CGIs TF symbol database.scoreothers
chr7 1894240 1895255 1016 -0.4587633 1855430 2274378 418949 - ENSG00000002822 ENST00000406869 0.2425116 chr7_1894240_1895255 1 1 0 0 0 1 0 1 MAD1L1 3
chr17 26712105 26712449 345 0.4203297 26689878 26725151 35274 + ENSG00000004139 ENST00000379061 0.9780575 chr17_26712105_26712449 1 1 0 0 0 0 0 1 SARM1 3
chr17 42462135 42462494 360 0.4608637 42449550 42468373 18824 - ENSG00000005961 ENST00000353281 1.9124522 chr17_42462135_42462494 1 1 0 0 0 0 0 1 ITGA2B 3
chr16 3067875 3068113 239 0.3508772 3066946 3072087 5142 + ENSG00000006327 ENST00000573001 4.6479969 chr16_3067875_3068113 1 1 0 0 0 0 0 1 TNFRSF12A 3
chr1 55266381 55266722 342 0.3076923 55245385 55268440 23056 - ENSG00000006555 ENST00000371276 1.4833449 chr1_55266381_55266722 1 1 0 0 0 0 1 1 TTC22 3
chr19 42056859 42057411 553 -0.4423077 42054386 42093196 38811 + ENSG00000007129 ENST00000407170 1.4248538 chr19_42056859_42057411 1 1 0 0 0 0 0 1 CEACAM21 3
Table 4: Resulting annotation for genes

The first five columns (seqnames, start, end, width and beta) refer to the annotated DMRs, symbol refers to the overlapping gene, score is the ranking score assigned to the DMRs and is specific for genes only, others contains all the additional information found in the input table, gene start, gene end, gene width and gene strand are coordinates and characteristics regarding the overlapped gene. For other elements these characteristics will be specific for the selected feature For example for introns we will have intron start, intron end, intron width and intron strand. The second part of the table contains the gene id and transcript name according to the database used for annotations, genes perc is the percentage of the gene (or the current feature selected) overlapped by the DMR, tag is an ID referred to the DMR, dgv, gnomad, NCG, NCG type, COSMIC, hacer, CGIs and TF are the information retrieved from database querying, where 1 is used when the DMR’s range is present in the database, and finally database score is the computed score for database as described in the paper. According to the selected element/feature, additional columns are shown: these are one returning the overlap in bp between the DMR and the feature, and rank which returns the position of the annotated element in the gene model. Moreover the GUI permits to sort the displaying table by clicking the desidered column and to search words, as the gene ids or transcript ids, within the table.