Tutorial: Adding a new reference population from VCF reference data - GenomicRisk/aeon GitHub Wiki
So you have access to your own reference VCF and want to extract population allele frequencies (AFs) from it? No problem!
Here I'll show you how to go from VCF -> your own reference AF file. This involves a combination of default CLI tools, as well as bcftools
and python
, so make sure you have those installed before you go ahead.
The general outline is:
- Subset your VCF for ancestry-informative loci and samples you're interested in
- Generate allele frequencies
- Merge results with existing reference AF
- Update the population labels file
1. Subset your VCF
1. a) Subset for AEon's loci
This is required if you are adding new populations to the existing reference file. If you are creating an entirely new reference file without any of the default reference populations, then you can subset for your own marker alleles rather than AEon's list. However, if you are selecting marker alleles apart from those listed in AEon's default reference AF file, then it is your responsibility to ensure they are appropriate and informative for your populations of interest (See 'The 128,097 ancestry-informative loci' for selection criteria).
# Create a 'regions' file containing the chromosome/position for AEon's ancestry-informative loci
cut -f1,3 refs/g1k_allele_freqs.txt | tail -n128097 > aeon_loci.txt
# Subset your VCF for relevant loci using this file
bcftools view my_new_population_samples.vcf.gz \
-R aeon_loci.txt \
-o my_new_population_samples.ae_loci.bcf
# Expand multi-allelic loci to be bi-allelic (simplifies later steps)
bcftools norm \
-a --atom-overlaps . \
my_new_population_samples.ae_loci.bcf \
-o my_new_population_samples.ae_loci.norm.bcf
1. b) Split or subset for specific samples [OPTIONAL]
You might have samples in your VCF belonging to one or multiple new populations. If this is the case, you'll need to make a VCF per population group.
b i) To extract just one population:
Create a file my_samples.txt
listing all the samples from your VCF that belong to your population of interest, with one sample ID per line. Then subset:
bcftools view my_new_population_samples.ae_loci.norm.bcf \
-S my_samples.txt \
-o my_subset_population_samples.bcf
b ii) To split into multiple VCFs, with one VCF per population:
Create a file sample_pop_groups.txt
where each line follows the format sampleID\t-\tPOP
, e.g.
sample1 - ROHAN
sample2 - GONDOR
sample3 - GONDOR
sample4 - NARNIA
...
Then split your VCF using the bcftools +split
plugin:
bcftools +split my_new_population_samples.ae_loci.norm.bcf \
-o split_by_pop \
-G sample_pop_groups.txt
All split population files will be located in the directory split_by_pop
and named according to your population groups (e.g. ROHAN.vcf
). Note that bcftools +split
will re-calculate AC and AN for all output files, but it won't re-calculate AF - hence the next step.
2. Generate allele frequencies
For each population VCF you have now created, you can generate allele frequencies using the bcftools +fill-tags
plugin:
bcftools +fill-tags ROHAN.vcf -o ROHAN_AFs.vcf -- -t AF
3. Merge with existing reference AF
Finally, extract the AF data for each variant and merge it with the existing reference AF data. I like to use python
for this because you can easily see which records are being kept or discarded. You'll need the pysam
library to read in the vcf
input.
import pandas as pd
from pysam import VariantFile
# Load in the base dataframe
base_af_df = pd.read_table("refs/g1k_allele_freqs.txt")
# **If you want to drop any existing populations, do so here, e.g.
# base_af_df = base_af_df.drop(columns=['PUR','CLM']
# For each new population, make a dataframe containing AF information
new_pops = ['ROHAN','GONDOR','NARNIA']
for pop in new_pops:
# Create a base dict with fields for variant position information
af_dict = {
'CHROM': [],
'START': [],
'STOP': [],
'VAR_ID': []
pop: []
}
# Open and read the population VCF file
af_vcf = VariantFile(f"{pop}.vcf")
for rec in af_vcf.fetch():
af_dict['CHROM'].append(rec.chrom)
af_dict['START'].append(rec.pos - 1)
af_dict['STOP'].append(rec.pos)
af_dict['VAR_ID'].append(f"{rec.chrom}_{rec.pos}_{rec.ref}_{rec.alts[0]}")
af_dict[pop].append(rec.info['AF'][0]
# Convert to dataframe
af_df = pd.DataFrame(af_dict)
# Print some info
non_dup = len(af_df['CHROM','START'](/GenomicRisk/aeon/wiki/'CHROM','START').drop_duplicates())
print(f"Number of records for {pop}: {len(af_df)} ({len(af_df)-non_dup} multi-allelic; {128097-non_dup} AEon vars missing)")
# Merge with existing AF df. Note that this is an inner merge:
# - variants from the new VCF that aren't in AEon's ancestry-informative list will be excluded
# - variants in AEon's list that aren't called in the new VCF will be excluded
base_af_df = base_af_df.merge(af_df)
# Save new AF file
base_af_df.to_csv("new_reference_afs.txt", sep='\t', index=False)
Note 1: If you want to make an entirely new AF file without the default reference populations, you can follow the same method. Simply create an initial base_af_df
from your first population VCF instead of reading in the default AF file, then use a for loop to merge subsequent populations.
Note 2: If you want to remove some populations from the default reference (for example, to replace the 1000 Genomes American populations with American populations from HGDP), you can also do this at the point marked in the code above.
4. Update the population labels file
Create a population labels file based on the tab-separated refs/pop2super.txt
file by adding your new populations and their corresponding superpopulation at the bottom:
Population Superpopulation
ACB AFR
ASW AFR
...
TSI EUR
YRI AFR
ROHAN MIDDLE_EARTH
GONDOR MIDDLE_EARTH
NARNIA NARNIA
Congrats! You can now run AEon with your new-and-improved allele frequency file using
poetry run aeon sample_variants.bcf -a new_reference_afs.txt --population_labels pops_table.txt -o my_output