The Reference Allele Frequency File - GenomicRisk/aeon GitHub Wiki

Genetic ancestry estimation can be performed using either supervised or unsupervised approaches. AEon is a supervised method; its underlying model relies on the data provided in its reference allele frequency file. The benefit of a supervised approach is that it provides consistent ancestry estimates which allow the user to interpret results within the context of established reference populations and associated population health data - a distinct advantage in clinical genomics applications.

However, this impacts its utility for unrepresented populations as it can only identify ancestry from populations present in its reference database. Due to this known limitation, we have developed AEon to enable user-supplied reference allele frequency files so that it can be extended to further populations when relevant allele frequency information is available.

You may also find that the default reference AF file contains loci outside the regions covered by your input sample VCF (e.g. if you only have exome data). By default, AEon assumes you have full genome coverage and imputes missing genotypes as homozygous reference. If this is NOT the case, you MUST create a subset AF file that reflects your input data to avoid major confounding.

Below you will find information on the default reference AF file as well as instructions on how to modify this reference to suit your needs.

The Default Reference Allele Frequency File

Reference Populations

Allele frequencies from 26 reference populations were obtained using data from 1000 Genomes. These reference populations are grouped into 5 superpopulations: African (AFR), American (AMR), East Asian (EAS), European (EUR) and South Asian (SAS). Due to the dynamic nature of human populations across their historical and geographical contexts, many genetic markers are shared between populations - particularly within the same superpopulation. As such, examining output at the superpopulation level can provide a clearer view of the results.

It is worth noting that the American populations sampled by the 1000 Genomes Project demonstrate some European admixture, in particular the Colombian (CLM) and Puerto Rican (PUR) populations. This is an accurate representation of these present-day population groups, but it does impact their use as reference samples. Since these admixed population samples were used to generate reference allele frequencies, AEon may assign small fractions of American ancestry to individuals with European genetic ancestry when using the default reference file. If your sample scores highly for EUR populations and has a very small fraction of ancestry (e.g. <=0.03) assigned to either CLM or PUR, treat the AMR estimate with caution. Visualisation can help in these cases.

The 128,097 ancestry-informative loci

An initial list of (GRCh37) variants of interest was retrieved from mpinese/mgrb-manuscript. This list contained 133872 variants, selected using several criteria:

  • Easy to sequence: to reduce the number of missing or poor-quality values in both training and test sets
  • Ancestry informative: loci that demonstrated Hardy-Weinberg Equilibrium within 1000 Genomes populations, but not across all populations
  • LD-pruned: variants were pruned based on Linkage Disequilibrium in order to discard loci correlated through co-inheritance patterns
  • Excluded highly conserved regions: since homogeneity across all populations would give no discriminative power.
  • For further info, see Pinese et al. (2020).

This initial list was then further refined to suit the reference set from 1000 Genomes. Variants in GRCh37 that had become reference in GRCh38 were excluded, as were variants for which the genotype was missing for some of the 2504 reference samples, resulting in the final list of 128097 variants.

Subsetting the Reference AF File

If your VCF does not cover the whole genome, e.g. the data was generated with Whole Exome Sequencing, or you're using a subset VCF for testing that only covers one chromosome, you will need to subset the reference AF file to your regions of coverage. Otherwise, all the loci present in the reference AF file but missing from your VCF will be imputed as homozygous reference, leading to erroneous results.

When you have a regions.bed file

The easiest way to create a reference AF file containing only your regions of interest is using bedtools (tested with bedtools v2.31.0; installation instructions here). Given a .bed file containing your regions of interest, use the following command:

bedtools intersect -a refs/g1k_allele_freqs.txt -b regions.bed -header > subset_AFs.txt

You can then run AEon with your subset reference AF file using the -a flag:

poetry run aeon sample_variants.bcf -o my_output -a subset_AFs.txt

Getting loci directly from your input VCF/BCF

If you don't already have a .bed file containing the regions covered by your VCF, you can still create a subset reference AF file - it just requires a few more steps, and for you to have both bcftools and bedtools.

# 1. Remove non-SNPs from VCF if present (as AEon does not use other variant types), and re-index:
bcftools filter -e ‘TYPE!=”snp”’ Aeon_example_phase3_1kg.bcf -o Aeon_example_phase3_1kg_snps.bcf
bcftools index Aeon_example_phase3_1kg_snps.bcf

# 2. Create a bed file from this VCF
bcftools query -f ‘%CHROM\t%POS0\t%END\n’ Aeon_example_phase3_1kg_snps.vcf > subset_loci.bed

# 3. Perform an intersect with the default reference AF file to generate your new subset AF file:
bedtools intersect -a refs/g1k_allele_freqs.txt -b subset_loci.bed -header > subset_AFs.txt

You can then run AEon with your subset reference AF file using the -a flag:

poetry run aeon Aeon_example_phase3_1kg_snps.bcf -o my_output -a subset_AFs.txt

Extending to new populations

Adding more populations to the reference file

If you have access to allele frequencies from new populations, you can easily add them to the reference file by pasting an additional column at the end of the existing tab-delimited file. The first row of your column should contain a population identifier (e.g. 'FIN' for Finnish), while the following rows contain the allele frequency in that population for the corresponding SNP. You can find the exact locus and variant referred to in each row from the first 4 columns in the reference file: CHROM START STOP VAR_ID. See g1k_allele_freqs.txt in the refs directory for further clarification.

Once you have added a population to your reference AF file, you will also need to add this population to the tabular 'population labels' file. This file contains the superpopulation assignment for each population. If the population you are adding is itself a superpopulation, simply use the same population identifier in both the population and superpopulation column. See pop2super.txt in the refs directory for format.

Once you have done this, you can now run AEon using your new reference AF file with the following command:

poetry run aeon sample_variants.bcf -a new_reference_afs.txt --population_labels pops_table.txt -o my_output

It is worth noting that the ancestry-informative marker alleles selected for the default reference AF file were chosen based on HW analysis of the 26 1000 Genomes Project populations, and as such, it is possible that they may not be an ideal set for characterising new populations. However, preliminary results from testing with Oceanian populations suggests that these loci are still effective markers when extended to previously unseen populations. If there are particular alleles known to be associated with your population of interest but not present in the default AF file, you can insert a new row for this variant into your reference AF file (note that the reference AF file doesn't need to be ordered by chromosome/position, but keeping it in order will improve the runtime of processing input).

#TODO: step-by-step tutorial of going from population sample VCF to inclusion in reference file.

Making a new reference file

You can also make your own allele frequency reference file from scratch if you want to use entirely different population allele frequencies. Note that care must be taken to select appropriate ancestry-informative loci if you choose to make your own reference file (See The 128,097 ancestry-informative loci). Make sure your allele frequency file is tab-delimited with the header CHROM START STOP VAR_ID POP1 [POP2 ...], and format your variant IDs as chrN_pos_REF_ALT. You will also need to supply a tab-delimited file of population labels and their corresponding superpopulations. See g1k_allele_freqs.txt and pop2super.txt in the refs directory for examples.