Usage - GenomicRisk/aeon GitHub Wiki

Inputs

Required input

AEon estimates fractional population membership for individuals based on their genotype calls and a population-specific allele frequency (AF) reference. As such, the tool at its most basic requires 2 things:

  • An indexed BCF/VCF.GZ file containing genotypes for all samples you wish to test
  • A reference AF file to compare the samples against.

AEon uses a default reference AF file based on the 1000 Genomes Project if no other reference is specified, meaning that the only required input is your sample BCF. However, you may need to adjust the reference AF file to suit your data:

  • If your VCF does not cover all regions of the genome (e.g. you only have exome data, or you are testing on a subset of loci), you will need to subset the reference AF file to your regions of coverage. See Subsetting the reference AF file for details.
  • If you want to estimate genetic ancestry derived from populations outside the 1000 Genomes Project (and have the data available), you can augment the default reference AF file or replace it with your own file in the required format. See Extending to New Populations for details.

Optional variables

  • -o PREFIX: Prefix for output files. The default is to capture all chars of the input filename before the first underscore e.g. 500-x01_variants_file.vcf -> 500-x01.
  • -t THREADS: Number of threads for estimation step. Default is 3.
  • -v, --verbose: Prints INFO level logs to stderr as well as WARNINGs. This will tell you if any of your samples had genotypes imputed as homozygous reference due to missing values. (This information is also available in the ae_stats.csv output file).
  • --visualisation: Output visualisation files, which plot each sample in 3-dimensional PC space with respect to the reference 1000 Genome Project samples for comparison.
  • --inheritance: Run in inheritance mode - all samples from the input VCF will be plotted/visualised together (assuming flag --visualisation is also used). Note that this only works for <=10 samples.
  • -a ALLELE_FREQS: Provide your own reference allele frequency file (as mentioned above).
  • --population_labels POP_LABEL_FILE: Provide a file listing the corresponding superpopulation for each population in the AF file. You only need to touch this if you are using your own reference AF file containing new populations - AEon will use refs/pop2super.txt by default.

Outputs

ae.csv file

AEon provides estimated ancestry fractions per population rounded to 2 decimal places. For each sample, the fractions across all populations should sum to 1 - however this sum may be slightly off due to rounding error. This fractional output enables AEon to model admixture. From analysis on trio (mother/father/child) data:

  • score > 0.1 -> significant
  • 0.05 < score < 0.1 -> likely significant
  • score < 0.05 -> likely insignificant
  • score < 0.02 -> noise

It is important to bear in mind that ancestral populations are not completely isolated from each other - some populations are more closely related to each other than others due to their historical geographic context. This population structure is partially modelled by grouping into superpopulations. The ae.csv file records which superpopulation each population belongs to, to assist interpretation. Some guidelines:

  • Superpopulation estimates are more robust, as superpopulations are more easily separated than populations. A superpopulation score > 0.1 is likely significant, even if population-level resolution is ambiguous.
  • If your sample scores highly for EUR populations and has a very small fraction of ancestry assigned to either CLM or PUR, treat the AMR estimate with caution. This is due to the admixed nature of Latino/Admixed-American populations in the 1000 Genomes reference (see reference populations).

ae_sample_stats.csv file

This output file provides some basic information about how samples were processed. For each sample, it provides the following information:

  • The PC1, PC2 and PC3 values calculated from the genotype vector, used to plot the sample against reference populations
  • FractionLociImputed: Number of ancestry-informative loci without a record present in the input VCF, divided by 128097. All such loci are imputed as reference. This fraction should be the same for all samples in a VCF.
  • FractionLociNonCalled: Number of ancestry-informative alleles with a record present in the input VCF, but non-called genotype, divided by 2*128097. This often happens if you have merged multiple VCFs where one file contains a variant absent in the other file. All such loci are imputed as reference. This fraction will likely differ across samples in a VCF.
  • NumberMultiAllelic: Number of ancestry-informative loci with a different variant allele from the one listed in the allele frequency file. All non-reference values are compressed to 1.
  • LossPerLoci: Measure of model 'loss' (as returned by pyro.infer.SVI().step()) divided by number of loci.

Visualisation

To help visualise the distribution of ancestral populations, Principal Component Analysis was performed on the variant allele frequencies provided in refs/g1k_allele_freqs.txt. If you use the flag --visualisation, AEon will return one .png file per sample in your input VCF, plotting each sample against a backdrop of all the reference samples projected onto the top 3 principal components (PCs). The figure is composed of 3 plots, showing 2 dimensions at a time - PC1 vs PC2 in the top left, PC3 vs PC2 in the top right, and PC1 vs PC3 in the bottom left. The output for each sample will be saved to sampleName_PCA_plot.png.

If your VCF file contains related samples (e.g. parents and child), you can optionally add the --inheritance flag, which will plot all input samples on the same figure. In this case, the output file will be named according to the specified output file prefix. The --inheritance flag can be used with up to 10 samples in the input file, beyond which the tool will revert to a single plot per sample to avoid cluttered, difficult-to-read plots.

This visualisation can assist with interpretation of unusual cases. It is particularly interesting to observe that cases with admixture are very poorly characterised by PCA, often appearing on the plot outside the known population clusters and closest to an unrelated 'in-between' cluster.

Note that if you use a modified allele frequency file that contains different populations from the default file, this will NOT be reflected in the output plots. Each circular point in the output visualisation represents a sample from the 1000 Genomes dataset, whose coordinates have been pre-computed based on their genotype, NOT generated from the input allele frequency file.

⚠️ **GitHub.com Fallback** ⚠️