GRIEVOUS: Methodology - jvtalwar/GRIEVOUS GitHub Wiki

So how do we get about the insidious problem of guaranteeing consistent feature indexing and orientation across our datasets? You use GRIEVOUS! But how does GRIEVOUS work? Let's take a look under the hood.

GRIEVOUS: The Database

The problems posed by variant indexing and orientation may be inextricably linked, but what if we aim to handle both at the same time? This is the fundamental tenet upon which GRIEVOUS was constructed. Abstract enough? Great, but for those looking for a tad more detail let me attempt to provide some contextualization.

GRIEVOUS operates on a flexible internal database, backed by a GRIEVOUS indexing standard - CHR:POS:REF:ALT. Initially empty, a GRIEVOUS database is populated in an iterative manner. Specifically, for each genomic dataset for which GRIEVOUS is called to act upon, all novel (i.e., not currently in the database) biallelic SNPs are added to the database in the dataset defined variant orientation (i.e. REF ALT), and are indexed in the database according to the above GRIEVOUS indexing standard. This covers the mechanics of the database, but fails to highlight its utility and importance.

Imagine, if you will, the GRIEVOUS database as a variant Rosetta Stone. For each dataset you wish to align, we must demystify its unique indexing and orientation language and subsequently translate it to our standardized language. The GRIEVOUS database is the medium by which we achieve this. For each dataset, prior to updating the database with all novel biallelic SNPs, we construct forward and reverse GRIEVOUS indices for all biallelic SNP candidates in our dataset. We then can query the database and ask whether a SNP exists and if so, if it conforms to the database orientation, or if it exists in the reverse orientation. All SNPs that exist in the GRIEVOUS database reverse orientation must be reoriented, as they diverge from the orientation of all prior GRIEVOUS aligned datasets.

Note the final step, Validate Variant Not In Database. For all novel SNPs (i.e., those SNPs that do not exist in the GRIEVOUS database in either the forward or the reverse orientation), we must assert that they do not exist in the database by CHR and POS. If any do, then we must necessarily remove them from both our dataset biallelic SNP set and novel biallelic set before database addition. Why? Well in such a case we have datasets for which SNPs are reporting different alleles, and thus fundamentally different information. For example:

GRIEVOUS Database: 7 15753 C T
Dataset: 7 15753 G A ← Diverges from database in alleles and thus cannot be added to database

The iterative nature of database alignment and update ensures that all biallelic SNPs for any t$^{th}$ given genomic dataset, upon GRIEVOUS realignment, will align in both index and orientation with all previous t-1 GRIEVOUS aligned datasets.

GRIEVOUS allows for the creation of multiple databases if desired by you, the user, to allow for organizational clarity and parallelizability when working on multiple projects, all of which might not necessitate dataset consistency with one another. Unsure of what this means? Don't worry about it for now, hopefully it will become clearer as your progress through your GRIEVOUS training in the tutorial.