The Problem: Standardizing Variant Indices and Orientation - jvtalwar/GRIEVOUS GitHub Wiki

Orientation and recovery... these are the banes of anyone who works with variant-level data. The integrity of any multi-cohort GWAS and/or PRS formulation, validation, and deployment is fundamentally dependent upon the quality of data (i.e., variants) used. For example, imagine the following scenario: You have performed a GWAS and formulated a novel PRS for your phenotype under investigation and observed remarkable stratification potential, outperforming current reported state-of-the-art on a held-out portion of your discovery set. However, come time to assess performance in an independent cohort, you observe horrid generalization (e.g., AUC = 0.5). What could be going wrong? Well either you markedly overfit to your training cohort's data, in which case you need to rethink your approach, or there is a divergence in the features between your training/validation dataset and test set. While population differences are expected between different datasets, a significant and systematic divergence in minor allele frequency (MAF) between cohorts is strongly suggestive of feature orientation divergences between your datasets. Let's dive deeper into this hypothetical and examine variant MAF distributions between our two cohorts:

Uh oh, the dreaded V phenomenon! MAF should generally be expected to follow a Y = X relationship (i.e., fall around the dotted line), but observing a clear subset of variants exhibiting a Y = 1-X relationship is strongly suggestive that certain variants exist in opposing variant orientations across your datasets. Why is this happening? Well let's take a look:

Certain SNPs common to both cohort 1 and cohort 2 define their reference (REF) and alternative (ALT) alleles backwards from one another, which directly affects the way allele counts are stored. For example, for the variant on chromosome (CHR) 8 at position (POS) 3846976, a count of 0 Cs in cohort 1 equates to a count of 2 Ts in cohort 2 in terms of the underlying genotype TT. To rectify this problem we must ensure feature consistency between our datasets (i.e., all datasets employ the same REF ALT orientation) by reorienting all divergent orientation variants to the same REF ALT allele standard.

Okay so now (hopefully) you comprehend the importance of guaranteeing consistent variant orientation between your datasets. Orientation though is only half of the problem. We also need to worry about variant recovery.

Unsystematic variant indexing standards serve as another barrier in the path to genomic enlightenment. In the above figure, note that cohort 1 and cohort 2 use different indices, or IDs for their variants. Specifically, cohort 1 utilizes a colon separated format (CHR:POS:REF:ALT) while cohort 2 utilizes rsIDs. Why is this a problem? Well let me answer your question with another question. If two different datasets index their variants differently, how do you identify the full set of variants common to both datasets? Naïve recovery by common index here will return an empty set, as neither cohort shares a common index.

Well, what if we convert cohort 2's indexing to follow the convention of cohort 1 (i.e., switch from rsID to CHR:POS:REF:ALT)? We would fail to recover all variants that exist in opposite orientations in each dataset, due to the discrepant ordering of REF:ALT in each index. Indexing and orientation are inextricably linked with one another, and neither issue can be solved without jointly addressing the other.

With two cohorts, you could manually search and rigorously check all the ways things can go awry and waste an ungodly amount of time in the process (and potentially still miss a few things), but now imagine you have summary statistics and eight cohorts all of which might have their own unique orientations and IDs. Good luck…

Throughout my graduate studies, problems in variant indexing and orientation consistency have persistently plagued me, and the continual realization of all the unique ways things can blow up in your face, coupled with the time lost trying to rectify these issues, has frustrated me beyond belief. For the mental welfare of graduate students and more generally researchers globally, and to rescue thousands of man hours, I felt a new approach was needed. Something that could identify and rectify all the ways things could go wrong. Something precise. Something undaunted. Something ruthless. But most importantly something automated. Enter GRIEVOUS...