Tutorial: Adding a new reference population from VCF reference data - GenomicRisk/aeon GitHub Wiki

So you have access to your own reference VCF and want to extract population allele frequencies (AFs) from it? No problem!

Here I'll show you how to go from VCF -> your own reference AF file. This involves a combination of default CLI tools, as well as bcftools and python, so make sure you have those installed before you go ahead.

The general outline is:

Subset your VCF for ancestry-informative loci and samples you're interested in
Generate allele frequencies
Merge results with existing reference AF
Update the population labels file

1. Subset your VCF

1. a) Subset for AEon's loci

This is required if you are adding new populations to the existing reference file. If you are creating an entirely new reference file without any of the default reference populations, then you can subset for your own marker alleles rather than AEon's list. However, if you are selecting marker alleles apart from those listed in AEon's default reference AF file, then it is your responsibility to ensure they are appropriate and informative for your populations of interest (See 'The 128,097 ancestry-informative loci' for selection criteria).

# Create a 'regions' file containing the chromosome/position for AEon's ancestry-informative loci
cut -f1,3 refs/g1k_allele_freqs.txt | tail -n128097 > aeon_loci.txt

# Subset your VCF for relevant loci using this file
bcftools view my_new_population_samples.vcf.gz \
	-R aeon_loci.txt \
	-o my_new_population_samples.ae_loci.bcf

# Expand multi-allelic loci to be bi-allelic (simplifies later steps)
bcftools norm \
	-a --atom-overlaps . \
	my_new_population_samples.ae_loci.bcf \
	-o my_new_population_samples.ae_loci.norm.bcf

1. b) Split or subset for specific samples [OPTIONAL]

You might have samples in your VCF belonging to one or multiple new populations. If this is the case, you'll need to make a VCF per population group.

b i) To extract just one population:

Create a file my_samples.txt listing all the samples from your VCF that belong to your population of interest, with one sample ID per line. Then subset:

bcftools view my_new_population_samples.ae_loci.norm.bcf \
	-S my_samples.txt \
	-o my_subset_population_samples.bcf

b ii) To split into multiple VCFs, with one VCF per population:

Create a file sample_pop_groups.txt where each line follows the format sampleID\t-\tPOP, e.g.

sample1	-	ROHAN
sample2	-	GONDOR
sample3	-	GONDOR
sample4	-	NARNIA
...

Then split your VCF using the bcftools +split plugin:

bcftools +split my_new_population_samples.ae_loci.norm.bcf \
	-o split_by_pop \
	-G sample_pop_groups.txt

All split population files will be located in the directory split_by_pop and named according to your population groups (e.g. ROHAN.vcf). Note that bcftools +split will re-calculate AC and AN for all output files, but it won't re-calculate AF - hence the next step.

2. Generate allele frequencies

For each population VCF you have now created, you can generate allele frequencies using the bcftools +fill-tags plugin:

bcftools +fill-tags ROHAN.vcf -o ROHAN_AFs.vcf -- -t AF

3. Merge with existing reference AF

Finally, extract the AF data for each variant and merge it with the existing reference AF data. I like to use python for this because you can easily see which records are being kept or discarded. You'll need the pysam library to read in the vcf input.

import pandas as pd
from pysam import VariantFile

# Load in the base dataframe
base_af_df = pd.read_table("refs/g1k_allele_freqs.txt")
# **If you want to drop any existing populations, do so here, e.g.
# base_af_df = base_af_df.drop(columns=['PUR','CLM']

# For each new population, make a dataframe containing AF information
new_pops = ['ROHAN','GONDOR','NARNIA']
for pop in new_pops:

    # Create a base dict with fields for variant position information
    af_dict = {
        'CHROM': [],
        'START': [],
        'STOP': [],
        'VAR_ID': []
        pop: []
    }

    # Open and read the population VCF file
    af_vcf = VariantFile(f"{pop}.vcf")
    for rec in af_vcf.fetch():
        af_dict['CHROM'].append(rec.chrom)
        af_dict['START'].append(rec.pos - 1)
        af_dict['STOP'].append(rec.pos)
        af_dict['VAR_ID'].append(f"{rec.chrom}_{rec.pos}_{rec.ref}_{rec.alts[0]}")
        af_dict[pop].append(rec.info['AF'][0]

    # Convert to dataframe
    af_df = pd.DataFrame(af_dict)

    # Print some info
    non_dup = len(af_df['CHROM','START'](/GenomicRisk/aeon/wiki/'CHROM','START').drop_duplicates())
    print(f"Number of records for {pop}: {len(af_df)} ({len(af_df)-non_dup} multi-allelic; {128097-non_dup} AEon vars missing)")

    # Merge with existing AF df. Note that this is an inner merge:
    #   - variants from the new VCF that aren't in AEon's ancestry-informative list will be excluded
    #   - variants in AEon's list that aren't called in the new VCF will be excluded
    base_af_df = base_af_df.merge(af_df)
    
# Save new AF file
base_af_df.to_csv("new_reference_afs.txt", sep='\t', index=False)

Note 1: If you want to make an entirely new AF file without the default reference populations, you can follow the same method. Simply create an initial base_af_df from your first population VCF instead of reading in the default AF file, then use a for loop to merge subsequent populations.

Note 2: If you want to remove some populations from the default reference (for example, to replace the 1000 Genomes American populations with American populations from HGDP), you can also do this at the point marked in the code above.

4. Update the population labels file

Create a population labels file based on the tab-separated refs/pop2super.txt file by adding your new populations and their corresponding superpopulation at the bottom:

Population	Superpopulation
ACB	AFR
ASW	AFR
...
TSI	EUR
YRI	AFR
ROHAN	MIDDLE_EARTH
GONDOR	MIDDLE_EARTH
NARNIA	NARNIA

Congrats! You can now run AEon with your new-and-improved allele frequency file using

poetry run aeon sample_variants.bcf -a new_reference_afs.txt --population_labels pops_table.txt -o my_output