Annotate and Score DMRs - GianlucaMattei/methyl.O GitHub Wiki
Annotate DMRs:
R:
Genes and gene’s features can be annotated by methyl.O using the function annotateDMRs. An example dataset is included in the package and can be loaded by
data("DMRsSubset", package="methyl.O")
The function uses as default options Ensembl database to annotate and the hg19 version of human assembly. It is possible also to use hg38 for Ensembl or UCSC database and hg19 to annotate. At the moment UCSC can not be used to annotate the hg38 version of the human assembly. The settings for annotation permit to customize results. Running annotateDMRs allows: - to modify the length of promoters (prom.length, default 1500bp) and the length of the heads (head.length, default 1500bp), - to decide whether to use the longest transcript for each gene or to perform the analysis on all transcripts for each gene (longest.trx, default TRUE), - to define the database to be used for annotation (annotation, default = Ensembl), - to set the assembly version (hg, default hg19), the beta threshold for each DMRs to be considered in the annotation (thr.beta, default = 0.3) and the percentage threshold for CpG Islands (CGIs) to be considered differentially methylated(thr.cgis, default = 0.4). Annotation.fast option concerns the method to return the gene Symbols. In fact, the ranges used as input are first annotated as Ensembl gene IDs, if Ensembl is used, or as Entrez gene IDs if UCSC is used, which in turn can be converted 1:1 to Symbols, speeding up the process, or can be 1:many (or many:1) for more accurate IDs conversion (annotation.fast, default = TRUE). The last three options permit, if the input data are not correctly formatted, to select the column number where to find the beta difference (col.betadiff, default 4), needed to perform the analysis and the columns of beta values of the two compared samples (col.beta1 and col.beta2, default = NULL), recommended but not necessary.
annotatedDMRs <- annotateDMRs(DMRsSubset , prom.length=1500, head.length=1500, longest.trx=TRUE,
annotation='ensembl', hg='hg19', annotation.fast=TRUE, thr.beta=.3, thr.cgis=.4,
col.betadiff = 4, col.beta1 = NULL, col.beta2 = NULL)
GUI:
In the “Annotate Methylated Regions” tab is possible to annotate the DMRs.The same page also displays the ranked list of the genes with scores, but this function will be explained in the next paragraph. On the left side are placed the main settings, while e the results are displayed in the main part of the pag.
Command | Description |
---|---|
Select Annotations to Use | specifies the database to use for annotation |
Select Assembly version | Set the assembly version to use |
Use Longest transcript | Yes to use the longest transcript for each gene or use all the transcripts for each gene |
Compute Fast Annotation | Select the type of conversion to gene symbol from ensembl gene IDs, if ensemble is used, or from entrez gene IDs if UCSC is used, It can be setted to 1:1, speeding up the process, or can be 1:many (or many:1) for more accurate IDs conversion |
Select Promoter Lengths | Set the promoter lengths |
Select Head Lengths | Set the head lengths |
Length Percentage of Altered Methylated CGIs | Set the length, in percentage, of overlaps between the the DMRs and the CGIs to be considered during the analysis |
Beta Diff. Threshold | The beta threshold for each DMRs to be considered in the annotation |
Column Position of Beta Diff. | Select the column number where to find the beta difference |
Column's Position of Beta Values of Sample 1 | Select the column number where to find the beta values of sample 1 |
Column's Position of Beta Values of Sample 2 | Select the column number where to find the beta values of sample 1 |
Table 2: Parameters for Annotate Methylated Regions tab
In the GUI we also implemented some plots to have a quick overview on results. These plots are in order: the beta difference values distribution by chromosome, the number distribution of DMRs by chromosome, the distribution of widths, ranks, beta values and database scores, for each annotation (Genes, Heads, TSS surrounding, Promoters, Exons, 5’UTRs, 3’UTRs, Introns). Two additional annotations can be plotted (First exons, First introns). The last plot compares the number of DMRs for each annotated region. Each of these plots has some specific customization settings by clicking on the red gear button. Other common settings are placed in the left column of the page:
Command | Description |
---|---|
Met Width Min | Minimum methylation width threshold in bp |
Met Width Max | Maximum methylation width threshold in bp |
Selected Feature Percentage Min | Minimum methylation width threshold in % |
Selected Feature Percentage Max | Maximum methylation width threshold in % |
Feature Rank Min | Minimum rank threshold |
Feature Rank Max | Maximum rank threshold |
Table 3: Additional parameters for plots and for results table in Annotate Methylated Regions tab
The last two options, Feature Rank Min and Max refer to the ranking position of the features in the gene model. For example, considering the exons, the first is rank 1, the second rank 2 and so on. The same for the introns. Of course these parameters are not useful for other features as promoters, heads and other non rankable features, thus they will not affect the plots related to these features.
Score Methylated Regions:
methyl.O implements a customizable score system aiming to suggest the most impacting DMRs. This score system is based on both databases and overlapped regions. In fact, the databases return the implication of the overlapped genes in certain pathologies as well as the presence of regulatory elements as CGIs and TF, while the different overlapped regions may affect the expression in different ways. The returned score aims to integrate these two informations. Overlapped regions affecting the gene expression can be set and must be one or more elements from annotation results, first exons and first introns. The two pieces of information can be weighted by the option score modifier. The score modifier ranges from 0 to 1, where 0 returns a score based on the database only and 1 a score based on the overlapped features, focusing the results on the effects of DMRs on gene expression.
R:
The function scoreAnnotatedDMRs accepts as input object the list resulting from annotateDMRs, as input options active.features which specifies the features considered to affect most the gene expression and the score.modifier,
annotatedDMRs <- scoreAnnotatedDMRs(annotatedDMRs, active.features = c("promoters", "heads"), score.modifier = 0.5)
GUI:
The score is computed automatically during the annotation process and is displayed in the Annotate Methylated Regions tab. Active features can be selected by the red dashboard at the top of the page while the score weights can be modified by the slider on the left panel.
Results of annotateDMRs and scoreAnnotatedDMRs:
R / GUI:
In R the results from annotations are stored in a list object where each element corresponds to a feature. The same object is shown in the GUI where the elements, therefore each feature, can be displayed by clicking on the red gear button. The results shown are affected by parameters explained in the table 3. An example of results retrieved by R or by the GUI, is shown below. The table shows the annotations for genes, the first element of the list of results, but the same columns can be found in the other elements of the list.
seqnames | start | end | width | beta | gene.start | gene.end | gene.width | gene.strand | gene.id | tx.namegenes.perc | tag | dgv | gnomad | NCG | NCG_type | COSMIC | cosmic.CGIs | hacer | CGIs | TF | symbol | database.scoreothers |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
chr7 | 1894240 | 1895255 | 1016 | -0.4587633 | 1855430 | 2274378 | 418949 | - | ENSG00000002822 | ENST00000406869 | 0.2425116 | chr7_1894240_1895255 | 1 | 1 | 0 | 0 | 0 | 1 | 0 | 1 | MAD1L1 | 3 |
chr17 | 26712105 | 26712449 | 345 | 0.4203297 | 26689878 | 26725151 | 35274 | + | ENSG00000004139 | ENST00000379061 | 0.9780575 | chr17_26712105_26712449 | 1 | 1 | 0 | 0 | 0 | 0 | 0 | 1 | SARM1 | 3 |
chr17 | 42462135 | 42462494 | 360 | 0.4608637 | 42449550 | 42468373 | 18824 | - | ENSG00000005961 | ENST00000353281 | 1.9124522 | chr17_42462135_42462494 | 1 | 1 | 0 | 0 | 0 | 0 | 0 | 1 | ITGA2B | 3 |
chr16 | 3067875 | 3068113 | 239 | 0.3508772 | 3066946 | 3072087 | 5142 | + | ENSG00000006327 | ENST00000573001 | 4.6479969 | chr16_3067875_3068113 | 1 | 1 | 0 | 0 | 0 | 0 | 0 | 1 | TNFRSF12A | 3 |
chr1 | 55266381 | 55266722 | 342 | 0.3076923 | 55245385 | 55268440 | 23056 | - | ENSG00000006555 | ENST00000371276 | 1.4833449 | chr1_55266381_55266722 | 1 | 1 | 0 | 0 | 0 | 0 | 1 | 1 | TTC22 | 3 |
chr19 | 42056859 | 42057411 | 553 | -0.4423077 | 42054386 | 42093196 | 38811 | + | ENSG00000007129 | ENST00000407170 | 1.4248538 | chr19_42056859_42057411 | 1 | 1 | 0 | 0 | 0 | 0 | 0 | 1 | CEACAM21 | 3 |
Table 4: Resulting annotation for genes
The first five columns (seqnames, start, end, width and beta) refer to the annotated DMRs, symbol refers to the overlapping gene, score is the ranking score assigned to the DMRs and is specific for genes only, others contains all the additional information found in the input table, gene start, gene end, gene width and gene strand are coordinates and characteristics regarding the overlapped gene. For other elements these characteristics will be specific for the selected feature For example for introns we will have intron start, intron end, intron width and intron strand. The second part of the table contains the gene id and transcript name according to the database used for annotations, genes perc is the percentage of the gene (or the current feature selected) overlapped by the DMR, tag is an ID referred to the DMR, dgv, gnomad, NCG, NCG type, COSMIC, hacer, CGIs and TF are the information retrieved from database querying, where 1 is used when the DMR’s range is present in the database, and finally database score is the computed score for database as described in the paper. According to the selected element/feature, additional columns are shown: these are one returning the overlap in bp between the DMR and the feature, and rank which returns the position of the annotated element in the gene model. Moreover the GUI permits to sort the displaying table by clicking the desidered column and to search words, as the gene ids or transcript ids, within the table.