GRIN2 - stjude/proteinpaint GitHub Wiki

GRIN2 (Genomic Random Interval, version 2) identifies genes that are recurrently hit by genomic lesions — copy-number gains/losses, SNV/indels, fusions, and structural variants — across a cohort, and tests whether that recurrence is greater than expected by chance. It produces a sortable Top Genes table and a genome-wide Manhattan plot.

It is based on the GRIN method: Pounds S, et al. A genomic random interval model for statistical analysis of genomic lesion data. Bioinformatics 2013;29(17):2088–95. doi:10.1093/bioinformatics/btt372.

Running GRIN2

Open the GRIN2 chart for a dataset (e.g. from the chart menu in the Mass app).
Apply a cohort filter to choose which samples to analyze (e.g. a diagnosis or subtype).
In the controls, check the data types to include (SNV/indel, CNV, Fusion, SV — only those available for the dataset are shown) and set their options (below).
Click Run GRIN2.

The analysis runs server-side and results are cached, so re-running the same filter + options returns instantly.

Options

SNV/indel

Consequences — which mutation classes to include (missense, frameshift, nonsense, splice, etc.). Defaults to the protein-changing set (plus start-lost/stop-lost). Use Select All / Clear All / Default to adjust. If none are selected, all classes are included.
MAF filter (if the dataset provides it) — keep only mutations whose variant allele fraction (VAF) passes a threshold (e.g. VAF > 0.1). The filter term can pool allele counts across assays — for the ASH dataset the Tumor DNA term sums WGS + WES read counts; you can also filter on a single assay.
Hypermutator Cutoff (default 8,000; 0 disables) — a sample with more than this many raw SNV/indel records is treated as hypermutated and contributes no SNV/indel lesions, since such samples otherwise dominate the gene-level statistics. Counted on raw records before consequence filtering. The exclusion is per data type — the sample's CNV/fusion/SV lesions still count — and excluded samples are reported in the results Summary.

CNV

How a CNV segment becomes a gain/loss lesion depends on how the dataset quantifies its CNV values. Most datasets declare this once via ds.queries.cnv.type, and the GRIN2 controls adapt their thresholds and labels to match.

A dataset may serve more than one CNV file type per case — it declares a list via ds.queries.singleSampleMutation.cnvTypes, and a CNV type selector appears so you can choose which one to analyze (see CNV type below); the selected type's value meaning drives the thresholds.

API-based datasets like GDC instead serve categorical CNV calls (gain/loss) fetched in batch — see API-based datasets (GDC) below. For categorical data no value thresholds apply: the gain/loss call comes straight from the segment's class.

Supported value types:

type	Value meaning	Baseline	Gain	Loss
log2ratio	segment log2 ratio (default if unset)	0	value ≥ Gain	value ≤ Loss
segmean	segment mean of log2 ratios (same as above)	0	value ≥ Gain	value ≤ Loss
copyNumber	absolute integer copy number	2	value ≥ Gain	value ≤ Loss
category	qualitative call (gain/loss), no value	—	class is a gain	class is a loss

A segment counts as a lesion when it crosses the relevant threshold; segments between the thresholds are treated as neutral and dropped.

Controls

CNV type (only for datasets that declare multiple CNV file types via cnvTypes) — radio buttons to choose which CNV file type to analyze. Only the selected type is fetched and analyzed; switching the radio re-sets the Loss/Gain threshold defaults and ranges to match that type's value meaning. Datasets with a single CNV type show no selector and use ds.queries.cnv.type. GDC does not use this selector — its CNV is categorical (see API-based datasets (GDC) below).
Loss Threshold — a segment is a loss when its value ≤ this threshold.
- log2ratio/segmean: default −0.4 (range −5 to 0).
- copyNumber: default 1 (range 0 to 2) — i.e. CN ≤ 1 is a loss.
Gain Threshold — a segment is a gain when its value ≥ this threshold.
- log2ratio/segmean: default 0.4 (range 0 to 5); a dataset may set its own default, e.g. 0.3.
- copyNumber: default 3 (range 2 to 20) — i.e. CN ≥ 3 is a gain, CN = 2 is neutral.
- For category data the Loss/Gain threshold controls are hidden — the gain/loss call comes directly from the segment's class.
Max Segment Length — segments longer than this (in bp) are dropped before analysis (default 2,000,000 = 2 Mb; set to 0 to disable). This applies to all CNV types and prevents a single very large passenger CNV from inflating significance for every gene it spans.
Hypermutator Cutoff (default 0 = off; opt-in) — a sample with more than this many raw CNV segments is excluded from CNV, by the same per-data-type mechanism as the SNV/indel cutoff (the sample's other data types still count; exclusions are reported in the Summary). Off by default because segment counts vary hugely by data source — sparse for categorical CNV (e.g. GDC, ~30/case) but hundreds-to-thousands for dense native segmentation, where a fixed cutoff would silently drop whole-sample CNV from aneuploid, gain-heavy samples. Set it per run when the data source warrants.
Dropped samples (multi-type datasets): when you pick a specific CNV type, a case that has CNV data but not of the selected type has only its CNV contribution dropped — its SNV/indel, fusion, and SV lesions are still analyzed. The number of such cases is reported as a Dropped (CNV type unavailable) row in the results Summary. This count typically rises when you select the rarer type for a cohort.

Tightening the thresholds (e.g. −0.1/0.1 → −0.4/0.3 for log2 ratio data) removes low-amplitude calls; the Max Segment Length cap controls broad arm/chromosome-scale events.

Fusion / SV

Included as lesions when the corresponding data type is checked; no per-type cutoffs.

Exclude artifact genes (recommended, on by default)

Removes genes that sit in known artifact regions before the statistics run, so the table is not dominated by non-driver loci that recur for technical or germline reasons (olfactory-receptor clusters, HLA, FAM90, POTE, GOLGA8, APOBEC3, KANSL1, etc.).

Exclude artifact genes (checkbox, default on) — toggles the mask.
Min gene overlap (default 0.5) — a gene is excluded only when at least this fraction of its span lies inside a masked region. The 0.5 default removes genes that sit inside artifact regions while sparing real drivers that merely abut one (e.g. KRAS overlaps a segmental duplication by ~10% and is kept).

The mask is the union of four hg38 region sets:

Layer	Source	What it removes
Blacklist	ENCODE / Kundaje GRCh38 unified blacklist	anomalous-signal / low-mappability regions
Segmental duplications	UCSC genomicSuperDups	duplicated, recombination-prone, paralogous regions
Assembly gaps	UCSC gap + centromeres	centromeres, telomeres, gaps
Common germline CNVs	DGV Gold Standard, frequency ≥ 1%	loci that are copy-number-polymorphic in the normal population

The germline-CNV layer is essential: loci like OR clusters, HLA class II, and KANSL1 are well-mapped but copy-number-variable in healthy people, so mappability/blacklist alone does not catch them.

When the mask runs, the results panel shows a Region Mask (excluded artifact genes) section reporting how many genes were excluded, a few examples, and the genome fraction masked (~14% with all four layers).

References: ENCODE blacklist — Amemiya et al., Sci Rep 2019, doi:10.1038/s41598-019-45839-z; segmental duplications — Sharp et al., AJHG 2005, doi:10.1086/431652; DGV — MacDonald et al., NAR 2014, doi:10.1093/nar/gkt958; region-exclusion practice — Ogata et al. (excluderanges), Bioinformatics 2023, doi:10.1093/bioinformatics/btad198.

API-based datasets (GDC): batch fetching & coverage

By default GRIN2 fetches each sample's mutations one at a time. For a local (sqlite) dataset that's a fast file read, but for an API-based dataset like GDC each fetch is a network round-trip, so a one-request-per-case loop over a large cohort is prohibitively slow.

When a dataset provides a batch getter (ds.queries.singleSampleMutation.batchGet), GRIN2 fetches SNV/indel and CNV up front in batched requests instead of per sample. For GDC this uses the case-level occurrences endpoints — ssm_occurrences for SNV/indel and segment_cnv_occurrences for CNV — each POST filtering by a list of cases.case_id, so hundreds of cases are retrieved per request (chunks of 300 cases, several chunks in flight at once). Fusion and SV are not batched and are still fetched per sample.

This is automatic and transparent — there is no control for it, and results are cached on filter + options exactly as for other datasets. What it changes is speed and the coverage rows you see in the Summary.

Details specific to the GDC batch path:

Open-access only. Only open-access SNV/indel is fetched; controlled-access cases are never queried. CNV segment data is open-access and always fetched.
Categorical CNV. GDC batch CNV comes back as qualitative gain/loss calls, so those lesions are classified directly from the call — independent of the Loss/Gain value thresholds described under CNV. (An earlier value-quantified approach — segment-mean / absolute copy number with thresholds — was too slow to fetch at cohort scale, so GDC CNV was switched to these categorical calls.)
Lean records. Only the fields GRIN2 needs are kept (data type, class, position); gene/mname/ref/alt and the raw CNV segment value are dropped to keep the payload small.
Lesion cap. The batched fetch stops once the run's lesion budget is reached, so a huge cohort (or all of GDC) doesn't pull millions of records before analysis. If the cap truncates the fetch, the affected samples are reported as dropped.

Because the batch getter resolves each requested sample to a GDC case, the Summary reconciles every sample into one of these coverage rows (each shown only when non-zero):

Row	Meaning
Unmatched samples	mapped to no GDC case, never queried
Controlled-access	SNV/indel is controlled-access, skipped
No open-access SNV/indel	case had no open-access mutations
No mutations	queried, genuinely no SNV/indel
No CNV data	queried, no CNV segment returned
Dropped (lesion cap reached)	in a case skipped after the cap was hit

Display

Max genes to show — number of rows in the Top Genes table (default 500).
Significance (q-value) threshold — q-values below this (default 0.05) are flagged in the table/tooltips and shown as interactive points in the Manhattan plot.

Reading the results

Top Genes table

For each gene and each included lesion type the table reports:

P-value (Gain/Loss/Mutation/…) — probability of seeing at least the observed number of affected subjects under the random-interval null.
Q-value (…) — the p-value adjusted for multiple testing (FDR). Use the q-value to judge significance (e.g. q < 0.05).
Subject Count (…) — number of subjects with that lesion type overlapping the gene.

When more than one lesion type is analyzed, additional N Lesion Types columns appear (1/2/3 …). These are constellation tests that combine evidence across lesion types — e.g. the "2 Lesion Types" q-value flags genes significant when gain+loss (or mutation+CNV) are considered jointly. A gene driven by a single type will be most significant in that type's column; a gene hit by several types will rise in the multi-type columns.

Sort any column by clicking its header. The table is sorted by overall significance by default.

Manhattan plot

Each point is a gene at its genomic position; the y-axis is −log₁₀(q-value). Colors distinguish lesion types (mutation, gain, loss, fusion, SV). Points above the significance threshold are interactive (hover for gene, type, subject count, q-value). The y-axis auto-scales/caps for very significant peaks.

How the statistics work

GRIN models each lesion as an interval placed at a random genomic location, and asks how often a gene would be overlapped by chance. For each gene it computes the probability that k or more subjects are hit (a Bernoulli convolution over per-subject hit probabilities), giving a p-value per lesion type. P-values are converted to q-values by Benjamini–Hochberg FDR correction. Multi-type constellation p-values are derived from the per-type p-value order statistics, so a gene recurrently hit by several lesion types can be significant even if no single type is.

Because the null assumes lesions land uniformly at random, artifact-dense regions (low-mappability, segmental-duplication, germline-CNV loci) accumulate spurious recurrence — which is why the Exclude artifact genes mask is on by default.

Tips & caveats

Always read q-values, not p-values, for significance.
Keep the artifact mask on unless you have a specific reason to inspect raw results; with it off, expect OR/HLA/segdup/germline-CNV genes near the top.
CNV thresholds and Max Segment Length are the main knobs for controlling how many CNV lesions enter the analysis.
Genuine drivers that happen to sit ≥50% inside an artifact region would be excluded by the mask — check the excluded-gene summary if a gene you expect is missing.
Results are cached on the analysis inputs (filter + options); changing a display-only setting reuses the cached statistics.