Reference Data Set Compilation - hakyimlab/summary-gwas-imputation GitHub Wiki

Purpose

To convert input genotype data into a compiled data set to be used as reference panel for the rest of the tools in this repository. Some tools require actual genotypes; some only require variant metadata; some need both.

Prerequisites

We assume the code in this repository to be downloaded somewhere, and a working python 3.5 installation with many dependencies (numpy, pandas, pyarrow, etc - see specific WIKI entry). We assume a UNIX+Bash environment with the following environment variables:

REPO=/path/to/your/repo
DATA=/path/to/your/data

In the following, feel free to replace $REPO and $DATA with the actual paths in your system. The application behind the following uses GTEx data as genotype reference panel.

Variant selection/Variant metadata

A list containing all the variants to be used and their metadata (name, chromosomal coordinates, rsid, etc) must be built first from a reference data set. You can do so using the following command:

python3 $REPO/get_reference_metadata.py \
-genotype $DATA/gtex_v8_eur_filtered.txt.gz \
-annotation $DATA/GTEx_Analysis_2017-06-05_v8_WholeGenomeSeq_838Indiv_Analysis_Freeze.lookup_table.txt.gz \
-filter MAF 0.01 \
-filter TOP_CHR_POS_BY_FREQ \
-output gtex_v8_eur_filtered_maf0.01_monoallelic_variants.txt.gz

The input file formats are explained at the end of this section.

Two optional filters are supported:

MAF: variants with frequency outside MAF < frequency < 1 -MAF will be discarded
TOP_CHR_POS_BY_FREQ: In datasets like GTEx, multiallelic variants are split into multiple entries with a same chromosomal position. This filter instructs to keep only the variant with highest frequency for each position with multiple variants.

This will generate a text file listing all selected variants. The output format should look like:

chromosome	position	id	allele_0	allele_1	allele_1_frequency	rsid
1	13550	chr1_13550_G_A_b38	G	A	0.017316017299890518	rs554008981
. . .

We refer to this as reference metadata compilation.

Genotype compilation

The script model_training_genotype_to_parquet.py takes input text files containing genotype information and assembles files in Apache Parquet format for fast access in other tools.

You can run this as in:

python3 $REPO/model_training_genotype_to_parquet.py \
-input_genotype_file $DATA/gtex_v8_eur_filtered.txt.gz \
-snp_annotation_file $DATA/gtex_v8_eur_filtered_maf0.01_monoallelic_variants.txt.gz METADATA \
-parsimony 9 \
--impute_to_mean \
--split_by_chromosome \
--only_in_key \
-output_prefix genotype/gtex_v8_eur_filtered_maf0.01_monoallelic_variants

---inpute_to_mean instructs the script to replace missing values in the genotype with the sample mean. This is so because many dependent tools can't handle NA (COLOC I'm looking at you)

--split_by_chromosome will generate one output file per chromosome. Although this is optional, it is recommended. Insane amounts of memory.
--only_in_key instructs to discard any variants absent from the snp annotation file. So, this is an easy place to exclude variants.
-snp_annotation_file: this file is mostly used to map from reference ids to RSIDS. (i.e. gtex has variants named like chr1_10407_C_g_b38). Passing METADATA flag specifies to use a file as output from the metadata reference compilation described earlier.

This script takes insane amounts of memory to run, when finishing the output format writing. Splitting by chromosome helps alleviate this problem.

Input file formats

GTEx snp annotation

The script in charge of collecting the metadata uses a SNP annotation file as downloaded from the GTEx portal: GTEx_Analysis_2017-06-05_v8_WholeGenomeSeq_838Indiv_Analysis_Freeze.lookup_table.txt.gz

chromosome	position	id	allele_0	allele_1	allele_1_frequency	rsid
1	13550	chr1_13550_G_A_b38	G	A	0.017316017299890518	rs554008981
. . .

Genotype dosage

Both scripts covered in this section use a genotype file as input. The format is a (possibly gzipped) text file:

varID   ID1      ID2      ID3 ...
chr1_13526_C_T_b38      0       1       0  ...
...

i.e. A file where the first column is the variant id, and the following columns are the dosages for the individuals in the sample. Each row is one variant.

The file used in our main application was generated from the GTEx vcfdosage file using command line tools (bcf_tools, awk, python):

filter_and_convert ()
{
echo -ne "varID\t" | gzip > $3
bcftools view $1 -S $2 --force-samples -Ou |  bcftools query -l | tr '\n' '\t' | sed 's/\t$/\n/' | gzip >> $3

#The first python inline script will check if a variant is blacklisted
NOW=$(date +%Y-%m-%d/%H:%M:%S)
echo "Starting at $NOW"
bcftools view -S $2 --force-samples $1 -Ou | \
bcftools query -f '%ID[\t%GT]\n' | \
awk '
{
for (i = 1; i <= NF; i++) {
    if (substr($i,0,1) == "c") {
        printf("%s",$i)
    } else if ( substr($i, 0, 1) == ".") {
        printf("\tNA")
    } else if ($i ~ "[0-9]|[0-9]") {
        n = split($i, array, "|")
        printf("\t%d",array[1]+array[2])
    } else {
        #printf("\t%s",$i)
        printf("Unexpected: %s",$i)
        exit 1
    }
}
printf("\n")
}
' | gzip >> $3

NOW=$(date +%Y-%m-%d/%H:%M:%S)
echo "Ending at $NOW"
}

filter_and_convert $DATA/GTEx_Analysis_2017-06-05_v8_WholeGenomeSeq_838Indiv_Analysis_Freeze.SHAPEIT2_phased.vcf.gz $DATA/samples.txt  $DATA/gtex_v8_eur_filtered.txt.gz

where samples.txt is a simple text file containing a whitelist of individuals to include in the final product. We used this whitelist to restrict to European-ancestry individuals, thus bounding variability in the sample.