GRIN2 - stjude/proteinpaint GitHub Wiki
GRIN2 (Genomic Random Interval, version 2) identifies genes that are recurrently hit by genomic lesions β copy-number gains/losses, SNV/indels, fusions, and structural variants β across a cohort, and tests whether that recurrence is greater than expected by chance. It produces a sortable Top Genes table and a genome-wide Manhattan plot.
It is based on the GRIN method: Pounds S, et al. A genomic random interval model for statistical analysis of genomic lesion data. Bioinformatics 2013;29(17):2088β95. doi:10.1093/bioinformatics/btt372.
Running GRIN2
- Open the GRIN2 chart for a dataset (e.g. from the chart menu in the Mass app).
- Apply a cohort filter to choose which samples to analyze (e.g. a diagnosis or subtype).
- In the controls, check the data types to include (SNV/indel, CNV, Fusion, SV β only those available for the dataset are shown) and set their options (below).
- Click Run GRIN2.
The analysis runs server-side and results are cached, so re-running the same filter + options returns instantly.
Options
SNV/indel
- Consequences β which mutation classes to include (missense, frameshift, nonsense, splice, etc.). Defaults to the protein-changing set (plus start-lost/stop-lost). Use Select All / Clear All / Default to adjust. If none are selected, all classes are included.
- MAF filter (if the dataset provides it) β keep only mutations whose variant allele fraction (VAF) passes a threshold (e.g. VAF > 0.1). The filter term can pool allele counts across assays β for the ASH dataset the Tumor DNA term sums WGS + WES read counts; you can also filter on a single assay.
CNV
How a CNV segment becomes a gain/loss lesion depends on how the dataset quantifies its CNV
values. The dataset declares this once via ds.queries.cnv.type, and the GRIN2 controls adapt
their thresholds and labels to match. Supported types:
type |
Value meaning | Baseline | Gain | Loss |
|---|---|---|---|---|
log2ratio |
segment log2 ratio (default if unset) | 0 | value β₯ Gain | value β€ Loss |
segmean |
segment mean of log2 ratios (same as above) | 0 | value β₯ Gain | value β€ Loss |
copyNumber |
absolute integer copy number | 2 | value β₯ Gain | value β€ Loss |
category |
qualitative call (gain/loss), no value | β | class is a gain | class is a loss |
A segment counts as a lesion when it crosses the relevant threshold; segments between the thresholds are treated as neutral and dropped.
Controls
- Loss Threshold β a segment is a loss when its value β€ this threshold.
log2ratio/segmean: default β0.4 (range β5 to 0).copyNumber: default 1 (range 0 to 2) β i.e. CN β€ 1 is a loss.
- Gain Threshold β a segment is a gain when its value β₯ this threshold.
log2ratio/segmean: default 0.4 (range 0 to 5); a dataset may set its own default, e.g. 0.3.copyNumber: default 3 (range 2 to 20) β i.e. CN β₯ 3 is a gain, CN = 2 is neutral.- For
categorydata the Loss/Gain threshold controls are hidden β the gain/loss call comes directly from the segment's class.
- Max Segment Length β segments longer than this (in bp) are dropped before analysis (default 2,000,000 = 2 Mb; set to 0 to disable). This applies to all CNV types and prevents a single very large passenger CNV from inflating significance for every gene it spans.
Tightening the thresholds (e.g. β0.1/0.1 β β0.4/0.3 for log2 ratio data) removes low-amplitude calls; the Max Segment Length cap controls broad arm/chromosome-scale events.
Fusion / SV
Included as lesions when the corresponding data type is checked; no per-type cutoffs.
Exclude artifact genes (recommended, on by default)
Removes genes that sit in known artifact regions before the statistics run, so the table is not dominated by non-driver loci that recur for technical or germline reasons (olfactory-receptor clusters, HLA, FAM90, POTE, GOLGA8, APOBEC3, KANSL1, etc.).
- Exclude artifact genes (checkbox, default on) β toggles the mask.
- Min gene overlap (default 0.5) β a gene is excluded only when at least this fraction of its span lies inside a masked region. The 0.5 default removes genes that sit inside artifact regions while sparing real drivers that merely abut one (e.g. KRAS overlaps a segmental duplication by ~10% and is kept).
The mask is the union of four hg38 region sets:
| Layer | Source | What it removes |
|---|---|---|
| Blacklist | ENCODE / Kundaje GRCh38 unified blacklist | anomalous-signal / low-mappability regions |
| Segmental duplications | UCSC genomicSuperDups |
duplicated, recombination-prone, paralogous regions |
| Assembly gaps | UCSC gap + centromeres |
centromeres, telomeres, gaps |
| Common germline CNVs | DGV Gold Standard, frequency β₯ 1% | loci that are copy-number-polymorphic in the normal population |
The germline-CNV layer is essential: loci like OR clusters, HLA class II, and KANSL1 are well-mapped but copy-number-variable in healthy people, so mappability/blacklist alone does not catch them.
When the mask runs, the results panel shows a Region Mask (excluded artifact genes) section reporting how many genes were excluded, a few examples, and the genome fraction masked (~14% with all four layers).
References: ENCODE blacklist β Amemiya et al., Sci Rep 2019, doi:10.1038/s41598-019-45839-z; segmental duplications β Sharp et al., AJHG 2005, doi:10.1086/431652; DGV β MacDonald et al., NAR 2014, doi:10.1093/nar/gkt958; region-exclusion practice β Ogata et al. (excluderanges), Bioinformatics 2023, doi:10.1093/bioinformatics/btad198.
Display
- Max genes to show β number of rows in the Top Genes table (default 500).
- Significance (q-value) threshold β q-values below this (default 0.05) are flagged in the table/tooltips and shown as interactive points in the Manhattan plot.
Reading the results
Top Genes table
For each gene and each included lesion type the table reports:
- P-value (Gain/Loss/Mutation/β¦) β probability of seeing at least the observed number of affected subjects under the random-interval null.
- Q-value (β¦) β the p-value adjusted for multiple testing (FDR). Use the q-value to judge significance (e.g. q < 0.05).
- Subject Count (β¦) β number of subjects with that lesion type overlapping the gene.
When more than one lesion type is analyzed, additional N Lesion Types columns appear (1/2/3 β¦). These are constellation tests that combine evidence across lesion types β e.g. the "2 Lesion Types" q-value flags genes significant when gain+loss (or mutation+CNV) are considered jointly. A gene driven by a single type will be most significant in that type's column; a gene hit by several types will rise in the multi-type columns.
Sort any column by clicking its header. The table is sorted by overall significance by default.
Manhattan plot
Each point is a gene at its genomic position; the y-axis is βlogββ(q-value). Colors distinguish lesion types (mutation, gain, loss, fusion, SV). Points above the significance threshold are interactive (hover for gene, type, subject count, q-value). The y-axis auto-scales/caps for very significant peaks.
How the statistics work
GRIN models each lesion as an interval placed at a random genomic location, and asks how often a gene would be overlapped by chance. For each gene it computes the probability that k or more subjects are hit (a Bernoulli convolution over per-subject hit probabilities), giving a p-value per lesion type. P-values are converted to q-values by BenjaminiβHochberg FDR correction. Multi-type constellation p-values are derived from the per-type p-value order statistics, so a gene recurrently hit by several lesion types can be significant even if no single type is.
Because the null assumes lesions land uniformly at random, artifact-dense regions (low-mappability, segmental-duplication, germline-CNV loci) accumulate spurious recurrence β which is why the Exclude artifact genes mask is on by default.
Tips & caveats
- Always read q-values, not p-values, for significance.
- Keep the artifact mask on unless you have a specific reason to inspect raw results; with it off, expect OR/HLA/segdup/germline-CNV genes near the top.
- CNV thresholds and Max Segment Length are the main knobs for controlling how many CNV lesions enter the analysis.
- Genuine drivers that happen to sit β₯50% inside an artifact region would be excluded by the mask β check the excluded-gene summary if a gene you expect is missing.
- Results are cached on the analysis inputs (filter + options); changing a display-only setting reuses the cached statistics.