Methods Comparison: GRMs - pcgoddard/Burchardlab_Tutorials GitHub Wiki

Methods for Computing Genetic Relatedness Matrices (GRMs)

Pagé Goddard

Background

will insert brief discussion about use of Genetic Relatedness Matrices

Resources

TOPmed Pipline

GCTA

REAP

GENESIS

GENESIS Vignette,
GENESIS Publication, (also see the TOPmed slides above)

TOPmed Pipeline

The TOPmed analysis pipeline is a great resources for association study design in general, but it is linked here because it includes a recommended approach for GRM computation. For GRM computation with an eye for confounding ancestry, TOPmed recommends the GENESIS approach:

KING
PC-AIR
PC-Relate

See below for more details

NB: "Section 3 Computing a GRM" calculates a basic Genetic Relationship matrix using SNPRelate package in R but does not take into account ancestry or population structure. For the more robust approach, see "Section 4 PC-Relate."

GCTA

commandline program Importance of GRMs: Allows for identification of closely related individuals. Objective of downstream GCTA analysis is to provide a heritability estimate the genetic variation captured by all SNPs (vs. GWAS which estimates variation captured by single SNPs). Including close relatives could bias the results with variance driven by pedigree phenotypic correlations.

# estimate genetic relatedness from SNPs
gcta64 --bfile input.binary --make-grm --out output.files

From the publication: As a by-product, we provide a function in GCTA to calculate the eigenvectors of the GRM, which is asymptotically equivalent to those from the PCA implemented in EIGENSTRAT11 because the GRM (Ajk) defined in GCTA is approximately half of the covariance matrix (Jjk) used in EIGENSTRAT. The only purpose of developing this function is to calculate eigenvectors and then include them in the model as covariates to capture variance due to population structure. More sophisticated analyses of the population structure can be found in programs such as EIGENSTRAT and STRUCTURE.

PROs

super easy to run
- takes plink files
- no extra input required
fast

CONs

does not take into account ancestry or any external population structure proxy
- not a reliable estimator for admixed populations
GCTA has been criticized for unreliable estimates (GrantedI haven't read into this too much)
I (Pagé) don't really understand the math...

REAP

commandline program
concept: use pre-computed ancestry measures to adjust for population structure
approach: model-based
- calculate global ancestry per individual and allele frequency per ancestral group using something like ADMIXTURE
- calculate relatedness coefficients after adjusting for ancestry

PROs

accounts for population ancestry
designed for admixed populations
easy one-liner once you have your admixture output

CONs

requires additional inputs calculated externally
- not tough to calculate; see my ADMIXTRE Tutorial
- potential for inaccuracies in admixed pop of unknown/poorly defined ancestries
model-based methods can be confounded by familial relatedness due to inability to distinguish b/w ancestral groups and clusters of close relatives (source)

GENESIS

R
concept: trianing on unrelated subpopulation and using PCs to correct for population structure
approach: model-free
- KING-robust: estimate ancestral divergence and apparent relatedness separately
- PC-Air: use PCs to capture the ethnic components of the population structure and identify the unrelated and related clusters
  - ancestral divergence scores used to ensure the unrelated subpop is representative of the full population's ancestry dsitribution
- PC-Relate: calculate kinship coefficient for unrelated group first, then extrapolate to the related group, using ancestry-representative PCs to correct for pop structure
  - unrelated first: prevent confounding by related individuals
  - using first n PCs: (at your discretion) to capture pop structure / ethnic diversity

PROs

~~doesn't need external inputs~~
good track record
- TOPmed
- favored in comparison studies (after more computationally heavy IBD-inferrence approaches)
designed to work well for both admixed and homogenous populations

CONs

PC-relate took about 8 hours to run (R-studio)
claims to not require external input, but it kind of does - it can all be done in R so you can totally set up a pipeline for it though
- uses KING-robust estimates (commandline)
- or SNPRelate funciton (in R)
- requires some reformatting from either output to prep for PC-Air

Overall winner: GENESIS/TOPmed Pipeline

see tutorial