Annotate and Score DMRs - GianlucaMattei/methyl.O GitHub Wiki

Annotate DMRs:

R:

Genes and gene’s features can be annotated by methyl.O using the function annotateDMRs. An example dataset is included in the package and can be loaded by

data("DMRsSubset", package="methyl.O")

The function uses as default options Ensembl database to annotate and the hg19 version of human assembly. It is possible also to use hg38 for Ensembl or UCSC database and hg19 to annotate. At the moment UCSC can not be used to annotate the hg38 version of the human assembly. The settings for annotation permit to customize results. Running annotateDMRs allows: - to modify the length of promoters (prom.length, default 1500bp) and the length of the heads (head.length, default 1500bp), - to decide whether to use the longest transcript for each gene or to perform the analysis on all transcripts for each gene (longest.trx, default TRUE), - to define the database to be used for annotation (annotation, default = Ensembl), - to set the assembly version (hg, default hg19), the beta threshold for each DMRs to be considered in the annotation (thr.beta, default = 0.3) and the percentage threshold for CpG Islands (CGIs) to be considered differentially methylated(thr.cgis, default = 0.4). Annotation.fast option concerns the method to return the gene Symbols. In fact, the ranges used as input are first annotated as Ensembl gene IDs, if Ensembl is used, or as Entrez gene IDs if UCSC is used, which in turn can be converted 1:1 to Symbols, speeding up the process, or can be 1:many (or many:1) for more accurate IDs conversion (annotation.fast, default = TRUE). The last three options permit, if the input data are not correctly formatted, to select the column number where to find the beta difference (col.betadiff, default 4), needed to perform the analysis and the columns of beta values of the two compared samples (col.beta1 and col.beta2, default = NULL), recommended but not necessary.

annotatedDMRs <- annotateDMRs(DMRsSubset , prom.length=1500, head.length=1500, longest.trx=TRUE, 
annotation='ensembl', hg='hg19', annotation.fast=TRUE, thr.beta=.3, thr.cgis=.4,
col.betadiff = 4, col.beta1 = NULL, col.beta2 = NULL)

GUI:

In the “Annotate Methylated Regions” tab is possible to annotate the DMRs.The same page also displays the ranked list of the genes with scores, but this function will be explained in the next paragraph. On the left side are placed the main settings, while e the results are displayed in the main part of the pag.

Command	Description
Select Annotations to Use	specifies the database to use for annotation
Select Assembly version	Set the assembly version to use
Use Longest transcript	Yes to use the longest transcript for each gene or use all the transcripts for each gene
Compute Fast Annotation	Select the type of conversion to gene symbol from ensembl gene IDs, if ensemble is used, or from entrez gene IDs if UCSC is used, It can be setted to 1:1, speeding up the process, or can be 1:many (or many:1) for more accurate IDs conversion
Select Promoter Lengths	Set the promoter lengths
Select Head Lengths	Set the head lengths
Length Percentage of Altered Methylated CGIs	Set the length, in percentage, of overlaps between the the DMRs and the CGIs to be considered during the analysis
Beta Diff. Threshold	The beta threshold for each DMRs to be considered in the annotation
Column Position of Beta Diff.	Select the column number where to find the beta difference
Column's Position of Beta Values of Sample 1	Select the column number where to find the beta values of sample 1
Column's Position of Beta Values of Sample 2	Select the column number where to find the beta values of sample 1

Table 2: Parameters for Annotate Methylated Regions tab

In the GUI we also implemented some plots to have a quick overview on results. These plots are in order: the beta difference values distribution by chromosome, the number distribution of DMRs by chromosome, the distribution of widths, ranks, beta values and database scores, for each annotation (Genes, Heads, TSS surrounding, Promoters, Exons, 5’UTRs, 3’UTRs, Introns). Two additional annotations can be plotted (First exons, First introns). The last plot compares the number of DMRs for each annotated region. Each of these plots has some specific customization settings by clicking on the red gear button. Other common settings are placed in the left column of the page:

Command	Description
Met Width Min	Minimum methylation width threshold in bp
Met Width Max	Maximum methylation width threshold in bp
Selected Feature Percentage Min	Minimum methylation width threshold in %
Selected Feature Percentage Max	Maximum methylation width threshold in %
Feature Rank Min	Minimum rank threshold
Feature Rank Max	Maximum rank threshold

Table 3: Additional parameters for plots and for results table in Annotate Methylated Regions tab

The last two options, Feature Rank Min and Max refer to the ranking position of the features in the gene model. For example, considering the exons, the first is rank 1, the second rank 2 and so on. The same for the introns. Of course these parameters are not useful for other features as promoters, heads and other non rankable features, thus they will not affect the plots related to these features.

Score Methylated Regions:

methyl.O implements a customizable score system aiming to suggest the most impacting DMRs. This score system is based on both databases and overlapped regions. In fact, the databases return the implication of the overlapped genes in certain pathologies as well as the presence of regulatory elements as CGIs and TF, while the different overlapped regions may affect the expression in different ways. The returned score aims to integrate these two informations. Overlapped regions affecting the gene expression can be set and must be one or more elements from annotation results, first exons and first introns. The two pieces of information can be weighted by the option score modifier. The score modifier ranges from 0 to 1, where 0 returns a score based on the database only and 1 a score based on the overlapped features, focusing the results on the effects of DMRs on gene expression.

R:

The function scoreAnnotatedDMRs accepts as input object the list resulting from annotateDMRs, as input options active.features which specifies the features considered to affect most the gene expression and the score.modifier,

annotatedDMRs <- scoreAnnotatedDMRs(annotatedDMRs, active.features = c("promoters", "heads"), score.modifier = 0.5)

GUI:

The score is computed automatically during the annotation process and is displayed in the Annotate Methylated Regions tab. Active features can be selected by the red dashboard at the top of the page while the score weights can be modified by the slider on the left panel.

Results of annotateDMRs and scoreAnnotatedDMRs:

R / GUI:

In R the results from annotations are stored in a list object where each element corresponds to a feature. The same object is shown in the GUI where the elements, therefore each feature, can be displayed by clicking on the red gear button. The results shown are affected by parameters explained in the table 3. An example of results retrieved by R or by the GUI, is shown below. The table shows the annotations for genes, the first element of the list of results, but the same columns can be found in the other elements of the list.

seqnames	start	end	width	beta	gene.start	gene.end	gene.width	gene.strand	gene.id	tx.namegenes.perc	tag	dgv	gnomad	NCG	hacer	CGIs	TF	symbol	database.scoreothers
chr7	1894240	1895255	1016	-0.4587633	1855430	2274378	418949	-	ENSG00000002822	ENST00000406869	0.2425116	chr7_1894240_1895255	1	1	1	0	1	MAD1L1	3
chr17	26712105	26712449	345	0.4203297	26689878	26725151	35274	+	ENSG00000004139	ENST00000379061	0.9780575	chr17_26712105_26712449	1	1	0	0	1	SARM1	3
chr17	42462135	42462494	360	0.4608637	42449550	42468373	18824	-	ENSG00000005961	ENST00000353281	1.9124522	chr17_42462135_42462494	1	1	0	0	1	ITGA2B	3
chr16	3067875	3068113	239	0.3508772	3066946	3072087	5142	+	ENSG00000006327	ENST00000573001	4.6479969	chr16_3067875_3068113	1	1	0	0	1	TNFRSF12A	3
chr1	55266381	55266722	342	0.3076923	55245385	55268440	23056	-	ENSG00000006555	ENST00000371276	1.4833449	chr1_55266381_55266722	1	1	0	1	1	TTC22	3
chr19	42056859	42057411	553	-0.4423077	42054386	42093196	38811	+	ENSG00000007129	ENST00000407170	1.4248538	chr19_42056859_42057411	1	1	0	0	1	CEACAM21	3

Table 4: Resulting annotation for genes

The first five columns (seqnames, start, end, width and beta) refer to the annotated DMRs, symbol refers to the overlapping gene, score is the ranking score assigned to the DMRs and is specific for genes only, others contains all the additional information found in the input table, gene start, gene end, gene width and gene strand are coordinates and characteristics regarding the overlapped gene. For other elements these characteristics will be specific for the selected feature For example for introns we will have intron start, intron end, intron width and intron strand. The second part of the table contains the gene id and transcript name according to the database used for annotations, genes perc is the percentage of the gene (or the current feature selected) overlapped by the DMR, tag is an ID referred to the DMR, dgv, gnomad, NCG, NCG type, COSMIC, hacer, CGIs and TF are the information retrieved from database querying, where 1 is used when the DMR’s range is present in the database, and finally database score is the computed score for database as described in the paper. According to the selected element/feature, additional columns are shown: these are one returning the overlap in bp between the DMR and the feature, and rank which returns the position of the annotated element in the gene model. Moreover the GUI permits to sort the displaying table by clicking the desidered column and to search words, as the gene ids or transcript ids, within the table.