ADMIXTURE - pcgoddard/Burchardlab_Tutorials GitHub Wiki

ADMIXTURE Tutorial

Pagé Goddard

Sep 28, 2017

Content

ADMIXTURE Overview
Set up: Choose your file locations
Pipeline
Parsing Results
Bonus: Thinning Files

Tutorial for calculating global ancestry proportions from known ancestral populations using ADMIXTURE.

Sample data: SAGE 2

ADMIXTURE manual

ADMIXTURE Summary

Input	Output
test.bed (plink)	test.k.Q (ancestry fractions)
test.bim (plink)	test.k.P (allele freq of inferred pop)
test.fam (plink)
test.pop (generate in R)

test includes cohort of interest AND reference populations
test binary files are 1/2 encoded
k = number of reference populations

File Locations

script variables and directories

date="171231" # update current yymmdd
wkdir="$HOME/wkdir" # choose your working directory
ancestraldir="path/to/your/ancestry.reference.panels.directory"
datadir="path/to/your/genotype.data.dir"

# data files
sagegenofile="$datadir/genotype.data" # I'v set it up this way so that you can make a second object (eg: galagenofile) for another dataset
ancestralgenofile="$ancestraldir/ref.panels" # reference panel genotypes

# all your data should be in PLINK binary format (.bim .bed. fam)

PLINK: format data

ADMIXTURE requires 1 input that contains your population of unknown ancestry and the references of known ancestry. If you do not have a cohort-reference panel file, you must merge them together in PLINK. NB: plink will merge the files in alphabetical order by FID, so you may have your reference panels merged into the middle of the file rather than appended.

# PLINK variables
merge="genotype_ancestry_attempt1_${date}"
ancestry_flip="ref_panels_flippedalleles_${date}"
myadmix="genotype_ancestry_merge_${date}"

# merge cohort + reference pops
plink --bfile $sagegenofile --bmerge $ancestralgenofile.bed $ancestralgenofile.bim $ancestralgenofile.fam --make-bed --out $merge

# if error due to missed snps, flip the problematic snps and merge again
plink --bfile $ancestralgenofile --flip $merge_missnp --make-bed --out $ancestry_flip
plink --bfile $sagegenofile --bmerge $ancestry_flip --make-bed --out $merge

# if error persists, remove the snps with --exclude missnp

# recode Cohort + Ref binary files in 1/2 format
plink --bfile $merge --recode 12 --out $myadmix

optional: thin files for admixture, see bottom

R: build popfile

.pop is required for supervised admixture estimation; Each line of the .pop file corresponds to individual listed on the same line number in the .fam file. If the individual is a population reference, the .pop file line should be a string (beginning with an alphanumeric character) designating the population. If the individual is of unknown ancestry, use “-” (or a blank line, or any non-alphanumeric character) to indicate that the ancestry should be estimated. The final format should be a single column with no header that lines up with the fam file as shown here:

fam_ID_source	popfile
sage	-
sage	-
sage	-
euro_ref	ceu
euro_ref	ceu
afr_ref	yri
afr_ref	yri
sage	-
sage	-

# R variables
setwd("PATH_TO/wkdir")
date <- "171231" # update current yymmdd
mydat <- "genotype_ancestry_merge_" # same prefix as myadmix
merge <- read.table(file=paste(mydat, date, ".fam", sep=""), header = F, sep = ' ') # .fam file has the samples for your population and reference populations
ceu <- read.table("samples_ceu.txt", header = F, sep = ' ') # european ancestry
yri <- read.table("samples_yri.txt", header = F, sep = ' ') # african ancestry
nam <- read.table("samples_nam.txt", header = F, sep = ' ') # native american ancestry

# make popfile
merge$pop = ifelse(merge$X1 %in% ceu$V1, 'CEU', ifelse(merge$X1 %in% yri$V1, 'YRI', '-')) 
    # if IID is from CEU ref, write CEU; 
    # if IID is from YRI ref, write YRI; 
    # if IID is from NAM ref, write NAM;
    # if IID is in neither ref, it is a SAGE individual with unknwon ancestry; write - as placeholder
popfile <- as.data.frame(merge$pop)

# write out
write.table(popfile, file=paste(mydat, date, ".pop", sep=""), row.names = F, quote = F)
    # check that same prefix as $myadmix input

ADMIXTURE: calculate global ancestry

# ADMIXTURE variables

## required
myadmix="genotype_ancestry_merge_${date}"
popfile=${myadmix}.pop
npop="2" # must reflect number of populations in reference panel

## optional
nthreads="64"
accelnum="qn3"
seed="2017"
#bootnum="200" # optional


# admixture script
admixture $myadmix.ped $npop --supervised --seed=$seed -j$nthreads

Parsing ADMIXTURE results

Admixture output:

myadmix.2.P - allele frequencies of the inferred ancestral populations (SNP x ancestry) order: CEU YRI

wc -l $myadmix.2.P
    # 801660 lines
head $myadmix.2.P
    # 0.006379 0.000010
    # 0.827382 0.404886
    # 0.048119 0.021815
    # 0.874641 0.306484

myadmix.2.Q - ancestry fractions for each individual (sample x ancestry) order: CEU YRI

wc -l $myadmix.2.Q
    # 2165 lines
head $myadmix.2.Q
    # 0.142886 0.857114
    # 0.123782 0.876218
    # 0.208400 0.791600
    # 0.109842 0.890158

Reformat results

The following output can be used with programs such as REAP to calculate genetic relatedness matrices

# REAP variables
reapinput="genotype_transposed_${date}"
myID="${reapinput}_IID_only.txt"
admixprop="admixporp_${date}"
admixprop_sorted="admixprop_sorted_${date}"

# transpose genotype files
plink --bfile $sagegenofile --recode 12 transpose --output-missing-genotype 0 --out $reapinput

# reformat
cut -d' ' -f1 $reapinput.tfam > $myID # SAGE ID list
cut -d' ' -f1-2 merged_trial.fam | paste -d' ' - $myadmix.2.Q > admix_out_IDs_${date} # paste global ancestries to IDs; FID IID CEU YRI
    # -d' ' reads space as delimiter
    # -f1-2 selects fields 1 through 2 to extract with cut
    # - directs paste to use the standard input form the pipe instead of a file
grep -Fwf $myID admix_out_IDs_${date} > $admixprop # extract sage only
    # -f read patterns from file1
    # -F read patterns as plain strings
    # -w match patterns as whole word
awk 'FNR==NR {x2[$1] = $0; next} $1 in x2 {print x2[$1]}' $admixprop $reapinput.tfam > $admixprop_sorted
    # sorts $admixprop to match SAGE fam file
    # FNR==NR holds true for first named file
    # current line $0 is stored in associative array named x2 indexed by the first field [$1]
    # $1 in x2 will only start after first name file has been read completely
    # looks at the first field of line in second file, and prints the corresponding line from first

output order: FID IID EUR AFR

head $admixprop_sorted
    # CH30380 CH30380 0.153140 0.846860
    # VA30171 VA30171 0.120529 0.879471
    # VA30167 VA30167 0.180591 0.819409
    # CH30357 CH30357 0.134801 0.865199

Plotting results

#### files must be formatted with: race, eur, afr, natam
  # no subjectID, race column must come first. check example inputs

### Call libraries
require(truncnorm)
require(ggplot2)
require(gridExtra)
require(reshape2)

# set up space
setwd("PATH_TO/wkdir")
date <- "171231" # update current yymmdd
sage <- read.delim(print("$myadmix.2.Q"))
colnames(sage) <- c("EUR","AFR")
cbind(race = "AA", sage)
head(sage)
    #  race      EUR      AFR
    #  AA 0.142886 0.857114
    #  AA 0.123782 0.876218
    #  AA 0.208400 0.791600
    #  AA 0.109842 0.890158

# order data by increasing european ancestry
eur.sort = sage[order(sage$EUR),]
eur.sort$rank = 1:length(eur.sort$EUR)

# format data into geom_bar input
eur.sort.long = melt(eur.sort, id = c('rank','race'), variable.name = 'Ancestry')
head(eur.sort.long)
    #  race rank  Ancestry    value
    #    AA 1      EUR 0.088553
    #    AA 2      EUR 0.100432
    #    AA 3      EUR 0.134868
    #    AA 4      EUR 0.181657

# create plots
afr.plot <- ggplot(eur.sort.long, aes(x=rank, y=value, fill=Ancestry))
afr.plot <- afr.plot + geom_bar(width=1, stat='identity')           # play with width to see what it does
afr.plot <- afr.plot +  theme(legend.position="bottom")                 # move legend
afr.plot <- afr.plot +  xlab('Individual') + ylab('Genetic ancestry')# label the axes
afr.plot <- afr.plot +  ggtitle('SAGE 2')                 # create a title
afr.plot <- afr.plot +  scale_fill_manual(labels=c("Eur","Afr"),values=c("#cc9999","#660099"))
afr.plot <- afr.plot + theme(axis.text.x=element_blank(),
                             axis.ticks.x=element_blank())

afr.plot

resulting plot

# QED #

Thinning files

Documentation SEC 2.3

speeds up ADMIXTURE

SAGE 2

# PLINK variables
datadir="path/to/your/genotype.data.dir"
genofile=$myadmix
slidingwindowsize=50
slidewidth=10
r2val="0.1"
thingenofile="${myadmix}_thinned_indeppairwise_${slidingwindowsize}_${slidewidth}_${r2val}"

# run PLINK
plink --bfile $genofile --indep-pairwise $slidingwindowsize $slidewidth $r2val --make-bed --out $thingenofile

You can then continue through the popfile and admixture steps with the thinned files. NB: Remember to test the order of thingenofile before making popfile