GENESIS: GRM to GWAS - pcgoddard/Burchardlab_Tutorials GitHub Wiki

GENESIS PC-Relate Tutorial

Pagé Goddard

Sep 28, 2017

Update: Oct 24, 2017


GENESIS is an R-based pipeline developed by TOPMed. It uses PC-AiR for population structure inference that is robust to known or cryptic relatedness, and it uses PC-Relate for accurate relatedness estimation in the presence of population structure, admixutre, and departures from Hardy-Weinberg equilibrium.

Contents

Resources

Genesis

- uses sample HapMap data
- details on the concepts behind the programs used

External packages

Libraries

# be sure to install biocLite
source("https://bioconductor.org/biocLite.R")
biocLite(c('GENESIS',"GWASTools", "SNPRelate"))
libs <- c("GENESIS", "GWASTools", "SNPRelate", "gdsfmt")
lapply(libs, require, character.only = TRUE) 
rm(libs)

Errors: If you are having difficulty installing the necessary packages, you may need to update your R environment. To check version, run version(). In R-studio, run updateR().

1. Input Files

GENESIS uses a model-free approach and thus requires only the genotype file and no externally calculated ancestry proportions. The functions in the GENESIS package read genotype data from a GenotypeData class object created by the GWASTools package. Through the use of GWASTools, a GenotypeData class object can easily be created from:

Your input can be in one of these three forms

PLINK files

The SNPRelate package provides the snpgdsBED2GDS function to convert binary PLINK files into a GDS file.

file description
bed.fn path to PLINK.bed file
bim.fn path to PLINK.bim file
fam.fn path to PLINK.fam file
out.gdsfn path for output GDS file
# convert to gds with SNPRelate
snpgdsBED2GDS(bed.fn = "genotype.bed", bim.fn = "genotype.bim", fam.fn = "genotype.fam", out.gdsfn = "mygenotype.gds")

# Then continue with GDS file instructions:
mygeno <- GdsGenotypeReader(filename = "PATH_TO/mygenotype.gds")
myenoData <- GenotypeData(geno)

GDS files

mygeno <- GdsGenotypeReader(filename = "PATH_TO/mygenotype.gds")
myenoData <- GenotypeData(geno)

R Matrix

file description
genotype matrix of genotype values coded as 0 / 1 / 2; rows index SNPs; columns index samples
snpID integer vector of unique SNP IDs
chromosome integer vector specifying chr of each snp
position integer vector psecifying position of each SNP
scanID vector of unique individual IDs
mygeno <- MatrixGenotypeReader(genotype = genotype, snpID = snpID, chromosome = chromosome, position = position, scanID = scanID)

mygenoData <- GenotypeData(mygeno)

2. Calculating pairwise measures of ancestry divergence

Identifying a mutually unrelated and ancestry representative subset of individuals. KING-robust kinship coefficient estimator provides negative estimates for unrelated pairs with divergent ancestry to prioritize ancestrally-diverse individuals for the representative subset. Can be calculated with KING-robust from GENESIS or snpgdsIBDKING from SNPRelate.

with SNPRelate

prefered since it can be done without leaving R

The vignette states: "Alternative to running the KING software, the snpgdsIBDKING function from the SNPRelate package can be used to calculate the KING-robust estimates directly from a GDS file. The ouput of this function contains a matrix of pairwise estimates, which can be used by the GENESIS functions" this is a lie You must extract the matrix and prepend the IIDs as row and column names.

Input: mygenotype.gds Command: snpgdsIBDKING Output: matrix of pairwise estimates

Options: type ("KING-robust" - for admixed pop,"KING-homo" - for homogeneous pop), sample.id (choose samples subset; default all), snp.id (choose SNPs subset; default all), autosome.only (default TRUE), maf (to use the SNPs with ">=maf" only; default no threshold) family.id (default NULL; all individuals treated as singletons. If provided, within- and between- family relationships are estimated differently), verbose

# snpgdsIBDKING requires a gds object; you cannot just point command to a .gds file
# read in the GDS file you just generated and verify its class
gdsfile <- snpgdsOpen(paste(genotype,".gds",sep=""))
class(gdsfile) #check for "gds.class"
# calculate KING IBD Kinship coefficients
ibd_king <- snpgdsIBDKING(gdsfile, type="KING-robust", verbose=TRUE)
class(ibd_king)
# snpgdsIBDClass (5 elements)
names(ibd_king)
# [1] "sample.id" "snp.id"    "afreq"     "IBS0"      "kinship"
# extract kinship matrix
KINGmat = as.matrix(ibd_king$kinship)

# check output
KINGmat[1:5,1:5]

##   0.5000000000 -0.004695793  0.001941412 -0.0009816093 -0.01903618
##  -0.0046957930  0.500000000 -0.009200767 -0.0070120462 -0.03430987
##   0.0019414117 -0.009200767  0.500000000 -0.0085147706 -0.01809720
##  -0.0009816093 -0.007012046 -0.008514771  0.5000000000 -0.02480543
##  -0.0190361844 -0.034309866 -0.018097203 -0.0248054274  0.50000000

#add row and column labels
rownames(KINGmat) <- c(ibd_king$sample.id)
colnames(KINGmat) <- c(ibd_king$sample.id)
# SAGE2 output using SNPRelate

KINGmat[1:5,1:5]

##                CH30380      VA30171      VA30167       CH30357     VA70028
##  CH30380  0.5000000000 -0.004695793  0.001941412 -0.0009816093 -0.01903618
##  VA30171 -0.0046957930  0.500000000 -0.009200767 -0.0070120462 -0.03430987
##  VA30167  0.0019414117 -0.009200767  0.500000000 -0.0085147706 -0.01809720
##  CH30357 -0.0009816093 -0.007012046 -0.008514771  0.5000000000 -0.02480543
##  VA70028 -0.0190361844 -0.034309866 -0.018097203 -0.0248054274  0.50000000

with KING-robust

KING (external program)
  • input: PLINK binary ped bile genotype.bed
  • command: king -b genotype.bed --kinship
  • output: king.kin, king.kin0

KING Output: .kin and .kin0 text files. king2mat extracts kinship coefficients for GENESIS functions.

king -b genotype.bed --kinship
# sample output

head king.kin # within family
##  FID     ID1     ID2     N_SNP   Z0      Phi     HetHet  IBS0    Kinship Error
##  28      1       2       2359853 0.000   0.2500  0.162   0.0008  0.2459  0
##  28      1       3       2351257 0.000   0.2500  0.161   0.0008  0.2466  0
##  28      2       3       2368538 1.000   0.0000  0.120   0.0634  -0.0108 0
##  117     1       2       2354279 0.000   0.2500  0.163   0.0006  0.2477  0
##  117     1       3       2358957 0.000   0.2500  0.164   0.0006  0.2490  0

head king.kin0 # between family
##  FID1    ID1     FID2    ID2     N_SNP   HetHet  IBS0    Kinship
##  28      3       117     1       2360618 0.143   0.0267  0.1356
##  28      3       117     2       2352628 0.161   0.0009  0.2441
##  28      3       117     3       2354540 0.120   0.0624  -0.0119
##  28      3       1344    1       2361807 0.093   0.1095  -0.2295
##  28      3       1344    12      2367180 0.094   0.1091  -0.2225
Convert to matrix with king2mat (GENESIS package)
  • input: .kin and .kin0 and iid
  • command: king2mat
  • output: matrix

iid = text file containing iids in order from GenotypeData

# read individual IDs from GenotypeData object
iids <- getScanID(genoData)
head(iids)

# create matrix of KING estimates
KINGmat <- king2mat(file.kin0 = system.file("wrkdir", "data.kin0", package="GENESIS"), 
                    file.kin = system.file("wrkdir", "data.kin", package="GENESIS"), 
                    iids = iids)
# sample output

KINGmat[1:5,1:5]

##           NA19919 NA19916 NA19835 NA20282 NA19703
##  NA19919  0.5000 -0.0009 -0.0059 -0.0080  0.0014
##  NA19916 -0.0009  0.5000 -0.0063 -0.0150 -0.0039
##  NA19835 -0.0059 -0.0063  0.5000 -0.0094 -0.0104
##  NA20282 -0.0080 -0.0150 -0.0094  0.5000 -0.0134
##  NA19703  0.0014 -0.0039 -0.0104 -0.0134  0.5000

3. Estimate principle components in related samples PC-AiR

Uses pairwise measure sof kinship and ancestry divergence to determine the unrelated and representative subset for analysis.

The KING-robust estimates are always used as measures of ancestry divergence for unrelated pairs of individuals; can also be used as measures of kinship for relatives (NOTE: they may be biased measures of kinship for admixed relatives with different ancestry)

Input:

input description
genoData GenotypeData class object
kinMat matrix of pairwise kinship coefficient estimates (KING-robust estimates or other source)
divMat matrix of pairwise measures of ancestry divergence (KING-robust estimates)

Command:

# run PC-AiR
mypcair <- pcair(genoData = mygenoData, kinMat = KINGmat, divMat = KINGmat)

You should see the following verbage:

    Partitioning Samples into Related and Unrelated Sets...
    Unrelated Set: 1665 Samples 
    Related Set: 320 Samples
    Running Analysis with 748665 SNPs - in 75 Block(s)
    Computing Genetic Correlation Matrix for the Unrelated Set: Block 1 of 75 ...
    ...
    Computing Genetic Correlation Matrix for the Unrelated Set: Block 75 of 75 ...
    Performing PCA on the Unrelated Set...
    Predicting PC Values for the Related Set: Block 1 of 75 ...
    ...
    Predicting PC Values for the Related Set: Block 75 of 75 ...
    Concatenating Results...

Output:

summary(mypcair)
    Call:
    pcair(genoData = myGenoData, kinMat = KINGmat, divMat = KINGmat)

    PCA Method: PC-AiR 

    Sample Size: 1985 
    Unrelated Set: 1665 Samples 
    Related Set: 320 Samples 

    Kinship Threshold: 0.02209709 
    Divergence Threshold: -0.02209709 

    Principal Components Returned: 20 
    Eigenvalues: 11.286 2.216 1.95 1.924 1.914 1.866 1.849 1.837 1.815 1.797 ...

    MAF Filter: 0.01 
    SNPs Used: 644529

Options

using a reference population alongside sample; perhaps to determine which PCs capture which ancestral groups

    pcair(genoData = mygenoData, unrel.set = IDs) # IDs of individuals from ref panel included in mygenoDate

    pcair(genoData = mygenoData,  kinMat = KINGmat, divMat = KINGmat, unrel.set = IDs) # IDs of individuals from ref panel included in mygenoDate; partition IDs first, then sample

Plotting PC-Air PCs

plot method provided by GENESIS package. Each point represents one individual. Visualization of population structure to identify clusters of individuals with similar ancestry. Can by altered by standard plot function manipulation. Basis and More Details

default: black dots = unrelated subset**;** blue pluses = related subsets

# plot top 2 PCs
plot(mypcair)

# plot PCs 3 and 4
plot(mypcair, vx = 3, vy = 4)

4. PC-Relate estimation of recent relatedness

Provides genetic relatedness estimates. Uses top PCs from PC-Air (ancestry capturing components) to adjust for population structure & individual ancestry in sample.

Input:

input file description
training.set mypcair$unrels vector of IIDs specifying unrelated subset to be used for ancestry adjustment per SNP
genoData mygenoData GenotypeData class object
pcMat mypcair$vectors[,1:n] matrix; columns are PCs 1 - n

Command:

# run PC-Relate
mypcrelate <- pcrelate(genoData = mygenoData, pcMat = mypcair$vectors[,1:2], training.set = mypcair$unrels)

You should see the following verbiage:

    Running Analysis with 748665 SNPs - in 75 Block(s)
    Running Analysis with 1985 Samples - in 1 Block(s)
    Using 2 PC(s) in pcMat to Calculate Adjusted Estimates
    Using 1665 Samples in training.set to Estimate PC effects on Allele Frequencies
    Computing PC-Relate Estimates...
    ...SNP Block 1 of 75 Completed - 7.165 mins
    ...
    ...SNP Block 75 of 75 Completed - 5.876 mins
    Performing Small Sample Correction...
(optional) format output as gds

The pcrelate function will either return an object of class pcrelate (when the input write.to.gds = FALSE) or it will save the output to a GDS file (when the input write.to.gds = TRUE). Saving the output to a GDS file is useful for large samples, as it allows for efficient storage and access to the estimates (see the package gdsfmt for more details). The following command can be used to read in the results from a previous PC-Relate analysis saved to the GDS file “tmp_pcrelate.gds”

  • write.to.gds = FALSE = pcrelate obj (default)
  • write.to.gds = TRUE = gds file tmp_pcrelate.gds

to read tmp_pcrelate.gds:

library(gdsfmt)
mypcrelate <- openfn.gds("tmp_pcrelate.gds")

Output:

command output
pcrelateReadKinship make a table of pairwise relatedness estimates
pcrelateReadInbreed make a table of individual inbreeding coeficients
pcrelateMakeGRM make a genetic relatedness matrix
option description
pcrelObj output from pcrelate; either a class pcrelate object or a GDS file
scan.include vector of individual IDs specifying which individuals to include in the table or matrix; default NULL
kin.thresh minimum kinship coefficient value to include in the table; default NULL which includes all pairs
f.thresh minimum inbreeding coefficient value to include in the table
scaleKin factor to multiply the kinship coefficients by in the GRM; default 2
# make pairwise relatedness estimates table
relatepairs.tbl <- pcrelateReadKinship(pcrelObj = mypcrelate, kin.thresh = NULL)
##         ID1     ID2  nsnp        kin           k0        k1           k2
## 7   NA19919 NA19908 19437 0.24668482 0.0015676062 1.0157801 -0.017347662
## 17  NA19919 NA19909 19626 0.24792942 0.0000000000 0.9911198  0.008880185
## 138 NA20282 NA20301 19765 0.32326286 0.0573556808 0.5780613  0.364583049
## 186 NA19902 NA19901 19640 0.25122134 0.0000000000 0.9883502  0.011649796
## 212 NA19902 NA19900 19771 0.23177090 0.0007574672 1.0228805 -0.023637964
# make inbreeding coefficient table
inbreedcoef.tbl <- pcrelateReadInbreed(pcrelObj = mypcrelate, f.thresh = 2^(-11/2))
##         ID  nsnp          f
## 11 NA20335 19949 0.03210545
## 13 NA19904 19814 0.02276732
## 20 NA20317 19933 0.02302650
## 30 NA20294 19827 0.02705174
## 37 NA20344 19905 0.03607983
# make GRM
mygrm <- pcrelateMakeGRM(pcrelObj = mypcrelate, scan.include = iids[1:5], scaleKin = 2)
# grm output

mygrm[1:5,1:5]

##               CH30380      VA30171      VA30167      CH30357       VA70028
##  0.9921236602  0.001587362  0.004589604  0.002636372  0.0008911428
##  0.0015873618  1.006070553  0.002143034  0.001831966 -0.0025865272
##  0.0045896042  0.002143034  1.002365658 -0.005183818  0.0049334393
##  0.0026363721  0.001831966 -0.005183818  1.003906724  0.0062427112
##  0.0008911428 -0.002586527  0.004933439  0.006242711  1.0005800629

quantile(mygrm)

##             0%           25%           50%           75%          100% 
##  -3.260058e-02 -3.014012e-03 -5.799979e-05  2.950375e-03  1.169809e+00

save files

# for .csv output: sep=',' and change suffix to ".csv"

write.table(mygrm, paste("GENESIS_Kincoef_matrix_",genotype,date,".txt",sep=""), row.names = F, quote = F)
write.table(relatepairs.tbl.nothresh, paste("GENESIS_relatedpairs_",genotype,date,".txt",sep=""), row.names = F, quote = F)
write.table(inbreedcoef.tbl.nothresh, paste("GENESIS_inbreedcoef_",genotype,date,".txt",sep=""), row.names = F, quote = F)

5. Convert to GCTA-friendly format

Our goal is to run a GWAS using our new GRM to correct for population structure and recent genetic relatedness in a more robust way than relying on cryptic relatedness filters in PLINK and correcting for % global African Ancestry.

Approach:

  1. convert GENESIS grm to the gcta grm.gz format
  2. use gcta to convert the grm.gz to a gcta-compatible grm

grm.gz format: Estimate the GRM, save the lower triangle elements to a compressed text file (e.g. test.grm.gz) and save the IDs in a plain text file (e.g. test.grm.id). source; grm.gz description

Output file format test.grm.gz (no header line; columns are indices of pairs of individuals (row numbers of the test.grm.id), number of non-missing SNPs and the estimate of genetic relatedness)

1    1    1000    1.0021
2    1    998     0.0231
2    2    999     0.9998
3    1    1000    -0.0031
...

test.grm.id (no header line; columns are family ID and individual ID)

011      0101
012      0102
013      0103
...

Reformat to GCTA-friendly .grm.gz format

## R

# get id file
ids <- as.data.frame(colnames(grm))
colnames(ids) <- "fid" # since our cohort is (supposedly) unrelated, fid=iid
ids$iid <- colnames(grm)
head(ids)
# fid     iid
# CH30380 CH30380
# VA30171 VA30171
# VA30167 VA30167
# CH30357 CH30357
# VA70028 VA70028
# CH30420 CH30420

write.table(ids, paste(geno,"_",date,"_genesis.grm.id", sep=""), quote = F, row.names = F, col.names = F) # col.names=F excludes header row
# get pairwaise kinship measure in linear dataframe
grm[upper.tri(grm, diag = F)] <- NA # set upper triangle of matrix to NA
cbind <- cbind(which(!is.na(grm),arr.ind = TRUE),na.omit(as.vector(grm)))
cbind <- as.data.frame(cbind)

head(cbind)
# [rownames] row col             V3
#  CH30380     1   1   0.9921236602
#  VA30171     2   1   0.0015873618
#  VA30167     3   1   0.0045896042
#  CH30357     4   1   0.0026363721
#  VA70028     5   1   0.0008911428
#  CH30420     6   1   0.0036559534

Now we are just missing the nsnp column indicating the number of non-missing snps between individuals. This information is stored in the pcrelate object we created earlier.

# take a look
mypcrelate$nsnp[1:5,1:5]
#        [,1]   [,2]   [,3]   [,4]   [,5]
# [1,] 628577     NA     NA     NA     NA
# [2,] 613092 618998     NA     NA     NA
# [3,] 620244 613901 640392     NA     NA
# [4,] 608682 608707 619293 623879     NA
# [5,] 618972 608727 630423 612669 688307

# get nsnp between samples
nsnp <- cbind(which(!is.na(mypcrelate$nsnp),arr.ind = TRUE),na.omit(as.vector(mypcrelate$nsnp)))
nsnp <- as.data.frame(nsnp)

head(nsnp)
#      row col       
# [1,]   1   1 628577
# [2,]   2   1 613092
# [3,]   3   1 620244
# [4,]   4   1 608682
# [5,]   5   1 618972
# [6,]   6   1 556901
# merge into gcta dataframe
gcta <- as.data.frame(cbind$row)
gcta$col <- cbind$col
gcta$nsnp <- nsnp$V3
gcta$kin <- cbind$V3

head(gcta)
# index$X1 X2   nsnp           kin
#       1  2 613092  0.0015873618
#       1  3 620244  0.0045896042
#       1  4 608682  0.0026363721
#       1  5 618972  0.0008911428
#       1  6 556901  0.0036559534
#       1  7 578414 -0.0004089310

write.table(gcta, paste(geno,"_",date,"_genesis.grm.gz", sep=""), quote = F, row.names = F, col.names = F)

Convert from .grm.gz to gcta binary format

## commandline
geno="mydata.prefix"
bin="/media/BurchardRaid01/LabShare/Home/pgoddard/gcta1.26.0"
$bin/gcta64 --grm-gz ${geno} --make-grm --out ${geno}_gcta

You are now ready to do a GWAS in GCTA using your shiny new custom grm


GENESIS Pipeline

Here I have concatenated all the commands for the GENESIS grm development to gwas pipeline with minimal commenting. Pipeline assumes plink bindary files. Extensive explanations of each can be found above in the vignette.

## r
# set up
source("https://bioconductor.org/biocLite.R")
biocLite(c('GENESIS',"GWASTools", "SNPRelate"))
libs <- c("GENESIS", "GWASTools", "SNPRelate", "gdsfmt")
lapply(libs, require, character.only = TRUE) 
rm(libs)

geno = "SAGE_mergedLAT-LATP_030816_rsIDs" # set to your plink data prefix
date=yymmdd # set date
setwd("wrkdir") # set wrkdir

# read in genotype data
snpgdsBED2GDS(bed.fn = paste(geno,".bed",sep=""), bim.fn = paste(geno,".bim",sep=""), fam.fn = paste(geno,".fam",sep=""), out.gdsfn = paste(geno,".gds",sep=""))
mygeno <- GdsGenotypeReader(paste(geno,".gds",sep=""))
mygenoData <- GenotypeData(mygeno)
close(mygeno)
## Pairwise measures of ancestry divergence

# calculate KING scores
gdsfile <- snpgdsOpen(paste(geno,".gds",sep="")) # read in GDS in SNPRelate format
class(gdsfile) # verify class "gds.class"
ibd_king <- snpgdsIBDKING(gdsfile, type="KING-robust", verbose=TRUE) # calculate king coefficients
KINGmat = as.matrix(ibd_king$kinship) # extract kinship matrix
rownames(KINGmat) <- c(ibd_king$sample.id)
colnames(KINGmat) <- c(ibd_king$sample.id)

# run pc-air
mypcair <- pcair(genoData = mygenoData, kinMat = KINGmat, divMat = KINGmat)

# plot PCs
plot(mypcair)
## Estimation of recent relatedness

# run pc-relate # this is VERY slow
mypcrelate <- pcrelate(genoData = myenoData, pcMat = mypcair$vectors[,1:2], training.set = mypcair$unrels)

# get kinship table, inbreeding table and grm
relatepairs.tbl.diag <- pcrelateReadKinship(pcrelObj = mypcrelate, kin.thresh = NULL)
inbreedcoef.tbl <- pcrelateReadInbreed(pcrelObj = mypcrelate, f.thresh = 2^(-11/2))
grm <- pcrelateMakeGRM(pcrelObj = mypcrelate, scaleKin = 2)

write.table(grm, paste("GENESIS_Kincoef_matrix_",geno,"_",date,".txt",sep=""), row.names = F, quote = F)
write.table(relatepairs.tbl, paste("GENESIS_relatedpairs_",geno,"_",date,".txt",sep=""), row.names = F, quote = F)
write.table(inbreedcoef.tbl, paste("GENESIS_inbreedcoef_",geno,"_",date,".txt",sep=""), row.names = F, quote = F)
## Reformat to GCTA-friendly .grm.gz format
ids <- as.data.frame(colnames(grm))
colnames(ids) <- "fid"
ids$iid <- colnames(grm)

# get pairwaise kinship measure in linear dataframe
grm[upper.tri(grm, diag = F)] <- NA
cbind <- cbind(which(!is.na(grm),arr.ind = TRUE),na.omit(as.vector(grm)))

# get nsnp between samples
nsnp <- cbind(which(!is.na(mypcrelate$nsnp),arr.ind = TRUE),na.omit(as.vector(mypcrelate$nsnp)))
nsnp <- as.data.frame(nsnp)

# merge into gcta dataframe
cbind <- as.data.frame(cbind)
gcta <- as.data.frame(cbind$row)
gcta$col <- cbind$col
gcta$nsnp <- nsnp$V3
gcta$kin <- cbind$V3

# save files
write.table(ids, paste(geno,"_",date,"_genesis.grm.id", sep=""), quote = F, row.names = F, col.names = F)
write.table(gcta, paste(geno,"_",date,"_genesis.grm.gz", sep=""), quote = F, row.names = F, col.names = F)
## commandline

# set variables
date="yymmdd" # update
geno="mydata.prefix_${date}" # point to your .grm.gz and .grm.id files
bin="/media/BurchardRaid01/LabShare/Home/pgoddard/gcta1.26.0" # wherever gcta is installed
phenofile="myphenotypes"
covarcat="my_categorical_covariates"
covarquant="my_numeric_covariates"
outname="${geno}_gcta_mlma"
tread="32" # adjust based on your system; typical is 10

# convert from .gz grm to gcta-ready grm
$bin/gcta64 --grm-gz ${geno}_genesis --make-grm --out ${geno}_gcta

# run GWAS in gcta
$bin/gcta64 --mlma --bfile $geno --grm $geno_gcta --pheno $phenofile --covar $covarcat --qcovar $covarquant --out $outname --thread-num $thread

# QED #

note: to look at just the first 5 rows and first 5 columns of matrix

# in shell
cut -d' ' -f 1-5 test.file | head -5 -
# in r
test.file[1:5,1:5]