Variant Reorientation and Realignment: grievous realign - jvtalwar/GRIEVOUS GitHub Wiki

Before we jump into reorientation and realignment with grievous realign, let's revisit our genomic datasets of interest:

Summary Statistics (MuSSFtafar): ~/GRIEVOUS_Tutorial/Datasets/Sarcoidosis/MuSSFtafar
Genotype Dataset 1 (Tatooine): ~/GRIEVOUS_Tutorial/Datasets/Sarcoidosis/Tatooine
Genotype Dataset 2 (Utapau): ~/GRIEVOUS_Tutorial/Datasets/Sarcoidosis/Utapau

Given that we have three genomic datasets of interest to reorient and realign, we must run grievous realign three times (once for each dataset of interest) with the same GRIEVOUS database. Before we actually run grievous realign though let's take a look at the command's general usage and break down the flags one by one: grievous realign: Usage and Arguments Explained.

Summary Statistics:

Let's begin with our summary statistics file (SSF). By starting with our SSF, this will ensure that all variants common across all datasets will be oriented in the direction of our summary statistics. For those attempting to validate or verify a reported PRS, this provides the advantage of consistency with reported PRS bounds (i.e., max and min). Note, it is not imperative that we start grievous realignment with summary statistics, as GREIVOUS handles effect size correction for all dataset variants that exist in the reverse orientation from a defined GRIEVOUS database.

Okay, now with our reasoning for starting with SSFs out of the way, we need to verify that our SSFs align with GRIEVOUS' columnar expectations. Let's take a look at our SSF again:

Hmmm, close but no cigar. We have all the necessary information in our chromosome-level SSFs, but one of our columns (SNP_ID) diverges from the GRIEVOUS formatting standard. There are two general solutions to the problem of discrepant column names: 1) We change column names from the current format to the GRIEVOUS format for all our chromosome-level SSFs (this seems inconvenient) or 2) we provide GRIEVOUS a mapping file to translate divergent columns. Let's go with option 2.

A GRIEVOUS mapping file is a white-space delimited file with two elements per line: 1) The original column name that diverges from the GRIEVOUS expected format 2) the corresponding GRIEVOUS formatted column name. Let's create the following GRIEVOUS mapping file for our SSF at ~/GRIEVOUS_Tutorial/ssfMapping.txt.

SNP_ID   ID

Now with our mapping file in hand we can go ahead and realign our SSF with GRIEVOUS!

#! /bin/bash
# ADD JOB SCHEDULER (e.g., Slurm, SGE) PARAMS HERE IF NEEDED

source activate ~/miniconda3/envs/GRIEVOUS_ENV # <-- activate the environment in which grievous is installed 

database=Tutorial_Sarcoidosis # <-- define our grievous database

start_time=$SECONDS # track script run time (optional)

#Parallelization (Slurm); Modify according to workflow of choice
chroms=(1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 X)
currentChrom=${chroms[$SLURM_ARRAY_TASK_ID - 1]}

currentFile=~/GRIEVOUS_Tutorial/Datasets/Sarcoidosis/MuSSFtafar/chr$currentChrom.ssf # <-- define chromosomal-level SSF to be grievous realigned  
writePath=~/GRIEVOUS_Tutorial/grievousRealigned/MuSSFtafar # <-- define write path directory (ensure this directory has been created first) 
mappingFile=~/GRIEVOUS_Tutorial/ssfMapping.txt # <-- define our mapping file that 'translates' current divergent columns to the grievous standard (if needed)
datasetAssembly=hg_imaginary # <-- define our dataset's genome assembly

grievous realign --file $currentFile --assembly $datasetAssembly --database $database -w $writePath --mapping $mappingFile --verbose --return_all

elapsed=$(( SECONDS - start_time ))
eval "echo Total Run Time/Elapsed time for CHR $currentChrom\: $(date -ud "@$elapsed" +'$((%s/3600/24)) days %H hr %M min %S sec')"

Some key points to note:

We defined a GRIEVOUS database with the name Tutorial_Sarcoidosis. As mentioned previously, GRIEVOUS allows for the creation of multiple databases, and to do so we simply provide a new name (i.e., an alias not currently registered in the GRIEVOUS library of databases) with the --database flag. After creating a database with a dataset's grievous realign run, all future datasets, for which we would like to ensure variant indexing and orientation consistency, should employ the same database. In the case of our tutorial, that means both of our genotype datasets will employ the same database as defined here.
We specified our tutorial dataset's genome assembly as hg_imaginary. Recall from our preprocessing section that all datasets of interest must employ the same reference assembly.
- If you are following along with our tutorial, using the tutorial datasets, all datasets can proceed with the specified imaginary assembly of hg_imaginary.
- If you are following along with the tutorial, but are using own data, replace hg_imaginary with the actual genome assembly for each dataset of interest (e.g., hg19 or GRCh38), including in following grievous realign steps.
Observe that we didn't pass in --file-type ssf here (though we could have if we desired). As our file ended with a .ssf extension, GRIEVOUS was able to automatically proceed. However, if your SSF files do not have this extension, ensure then that you pass in --file-type ssf when running grievous realign.
We are running grievous realign in verbose mode. This is recommended as it will provide a detailed log of your grievous realign run.
We are also employing the --return_all flag. This will ensure that the full SSF is returned, as opposed to the GRIEVOUS dataset standard subset. Note that any column information outside of the GRIEVOUS dataset standard subset (e.g., MAF) dependent on allele orientation may be incorrect upon realignment for flipped alleles and is on the user to correct upon completion of grievous realign.
- As this is the first dataset we are aligning though to a new GRIEVOUS database, we do not need to worry about this. All variants will be added in their dataset orientation.

Genotype Datasets:

Upon completion of grievous realign for our SSFs, we can proceed to grievous realign-ing our genotype datasets. Similarly to our SSF, we need to ensure that our pvars align with GRIEVOUS' columnar expectations.

Like our SSF, we have all the necessary information in our chromosome-level pvars, but one of our columns (#CHROM) diverges from the GRIEVOUS formatting standard. Let's go ahead and create a GRIEVOUS mapping file for our pvars at ~/GRIEVOUS_Tutorial/pvarMapping.txt.

#CHROM  CHR

Great! Now with our pvar mapping file in hand we can go ahead and realign our genotype datasets with GRIEVOUS. However, before we do, notice the pesky header/leading lines. Classically, these can often cause issues upon file loading given their variability in length and format between files. Rather than manually strip these headers from all of our pvar files, we can pass in a set of --comment_characters, -c to GRIEVOUS. Given these characters GRIEVOUS will interpret all continuous leading lines starting with these characters as comments and proceed accordingly. Okay now with that critical piece of information in hand, now we can go ahead and grievous realign our genotype datasets!

Genotype Dataset 1 (Tatooine)

#! /bin/bash
# ADD JOB SCHEDULER (e.g., Slurm, SGE) PARAMS HERE IF NEEDED

source activate ~/miniconda3/envs/GRIEVOUS_ENV # <-- activate the environment in which grievous is installed

database=Tutorial_Sarcoidosis # <-- specify our grievous database; note this matches the defined database used for SSF grievous realign-ment!

start_time=$SECONDS # track script run time (optional)

#Parallelization (Slurm); Modify according to workflow of choice
chroms=(1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 X)
currentChrom=${chroms[$SLURM_ARRAY_TASK_ID - 1]}

currentFile=~/GRIEVOUS_Tutorial/Datasets/Sarcoidosis/Tatooine/chr$currentChrom.pvar # <-- define chromosomal-level pvar to be grievous realigned 
writePath=~/GRIEVOUS_Tutorial/grievousRealigned/Tatooine # <-- define write path directory (ensure this directory has been created first) 
mappingFile=~/GRIEVOUS_Tutorial/pvarMapping.txt # <-- define a mapping file that 'translates' current divergent columns to the grievous standard (if needed)
datasetAssembly=hg_imaginary # <-- define our dataset's genome assembly

grievous realign --file $currentFile --assembly $datasetAssembly --database $database -w $writePath --mapping $mappingFile --verbose --return_all -c "##"

elapsed=$(( SECONDS - start_time ))
eval "echo Total Run Time/Elapsed time for CHR $currentChrom\: $(date -ud "@$elapsed" +'$((%s/3600/24)) days %H hr %M min %S sec')"

Similar to our SSFs, observe that we didn't pass in --file-type pvar here (though we could have if we desired). As our file ended with a .pvar extension, GRIEVOUS was able to automatically proceed. However, if your pvar files do not have this extension, ensure then that you pass in --file-type pvar when running grievous realign.

Genotype Dataset 2 (Utapau)

Upon completion of grievous realign we can proceed with our final (genotype) dataset. Our sample/template script is more or less the same as it was for genotype dataset 1, but no one likes a quitter. Let's be comprehensive and peer one last time into GRIEVOUS' realignment galaxy:

#! /bin/bash
# ADD JOB SCHEDULER (e.g., Slurm, SGE) PARAMS HERE IF NEEDED

source activate ~/miniconda3/envs/GRIEVOUS_ENV # <-- activate the environment in which grievous is installed

database=Tutorial_Sarcoidosis # <-- specify our grievous database; note this matches the defined database used for both SSF and genotype dataset 1 (Tatooine) grievous realign-ment!

start_time=$SECONDS # track script run time (optional)

#Parallelization (Slurm); Modify according to workflow of choice
chroms=(1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 X)
currentChrom=${chroms[$SLURM_ARRAY_TASK_ID - 1]}

currentFile=~/GRIEVOUS_Tutorial/Datasets/Sarcoidosis/Utapau/chr$currentChrom.pvar # <-- define chromosomal-level pvar to be grievous realigned 
writePath=~/GRIEVOUS_Tutorial/grievousRealigned/Utapau # <-- define write path directory (ensure this directory has been created first) 
mappingFile=~/GRIEVOUS_Tutorial/pvarMapping.txt # <-- define a mapping file that 'translates' current divergent columns to the grievous standard (if needed)
datasetAssembly=hg_imaginary # <-- define our dataset's genome assembly

grievous realign --file $currentFile --assembly $datasetAssembly --database $database -w $writePath --mapping $mappingFile --verbose --return_all -c "##" 

elapsed=$(( SECONDS - start_time ))
eval "echo Total Run Time/Elapsed time for CHR $currentChrom\: $(date -ud "@$elapsed" +'$((%s/3600/24)) days %H hr %M min %S sec')"

The GRIEVOUS Bridge: Binary Representation Realignment

Nearly there, nearly there! Upon completion of grievous realign, we need to bridge the gap between the outputs of our grievous realignment with the underlying binary representation of the genetic data (stored in the PLINK2 .pgen files), for each genotype dataset grievous realigned (i.e., genotype datasets 1 and 2 here). To do that we can simply pass the reorientation files generated by grievous realign to PLINK2!

#! /bin/bash
# ADD JOB SCHEDULER (e.g., Slurm, SGE) PARAMS HERE IF NEEDED

start_time=$SECONDS # track script run time (optional)

#Parallelization (Slurm); Modify according to workflow of choice
chroms=(1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 X)
currentChrom=${chroms[$SLURM_ARRAY_TASK_ID-1]}

basePath=~/GRIEVOUS_Tutorial/Datasets/Sarcoidosis/Tatooine #<-- path to where unaligned files are found
plink2=YOUR_PATH_TO_PLINK2 #<-- replace this with the corresponding pointer to your installation of PLINK2
grievousReorient=~/GRIEVOUS_Tutorial/grievousRealigned/Tatooine/Reorientation #<-- The path to the reorientation files generated by grievous realign for the genotype dataset found at $basePath
grievousReorientAllele=$grievousReorient/NoDuplicates_ReorientRefAlleleThisWay_CHR$currentChrom.tsv #<-- CHR-level GRIEVOUS file for variant reorientation
grievousReorientIndex=$grievousReorient/NoDuplicates_ReorientIndex_CHR$currentChrom.tsv #<-- CHR-level GRIEVOUS file for variant re-indexing

echo "Removing duplicates from unaligned p-files chr${chrom} in $basePath."

$plink2 --pfile $basePath/chr$currentChrom --make-pgen --rm-dup force-first --out $basePath/chr$currentChrom.noDuplicates

echo "Reorienting alleles as specified by GRIEVOUS realign..."

$plink2 --pfile $basePath/chr$currentChrom.noDuplicates --make-pgen --alt1-allele 'force' $grievousReorientAllele --out $basePath/chr$currentChrom.grievousRealigned #<-- #'force' prevents PLINK2 from erroring out for changing a 'known' reference allele: https://www.cog-genomics.org/plink/2.0/data#ref_allele

echo "Renaming variants as specified by GRIEVOUS realign..." 

$plink2 --pfile $basePath/chr$currentChrom.grievousRealigned --make-pgen --update-name $grievousReorientIndex --max-alleles 2 --out $basePath/chr$currentChrom.grievousRealigned.grievousReindexed

echo "GRIEVOUS realignment of underlying pfile data complete! Do what you must..."

elapsed=$(( SECONDS - start_time ))
eval "echo Total Run Time/Elapsed time for CHR $currentChrom\: $(date -ud "@$elapsed" +'$((%s/3600/24)) days %H hr %M min %S sec')"

The above script can be broken down into 3 steps:

Managing Duplicates: As briefly mentioned in the preprocessing section, we need to manage duplicate IDs in the original file. Without this step, PLINK2 will error out come variant reorientation time (unless you are lucky enough that there are no duplicate IDs in any of your original genotype datasets).

If you skipped the Reformatting Genotype Dataset IDs preprocessing step and duplicate non-unique IDs exist in your dataset, you cannot use '--rmdup force-first'. The reason for this is that if a GRIEVOUS identified valid biallelic variant is not the first ID instance in the original file, the wrong variant will be selected which will lead to errors during variant reorientation of the PLINK2 binary files. To fix this you can either go back and Reformat Genotype Dataset IDs, and re-run grievous realign on the genotype dataset of interest, or less desirably, exclude all non-unique IDs with --rmdup exclude-mismatch. If desired, PLINK2's documentation for managing duplicate variants can be found here.
Variant Reorientation: Upon removal of duplicates, we need to tell PLINK2 which alleles GRIEVOUS has marked as requiring reorientation (i.e., those variants found in the dataset that exist in the reverse orientation from the defined GRIEVOUS database). We can do this with the --alt1-allele 'force' command. If you're curious, the full PLINK2 documentation for setting REF/ALT alleles can be found here.
- Curious as to why we're using --alt1-allele 'force' instead of --ref-allele 'force'? This has to do with how PLINK2 tracks alleles. Specifically, PLINK2 counts REF alleles, so to orient the our alleles consistently across all datasets (i.e., ensuring consistency with the effect alleles in our SSF), we need to ensure the ALT allele is the one counted!
Variant Re-indexing: Finally, after having reoriented our variants, we need to convert the dataset index/ID to the GRIEVOUS indexing standard. This can be done with --update-name. In this step we also included --max-alleles 2 to remove multiallelic variants from the final set of PLINK2 binary files (though this is not necessary). Upon completion, note that all grievous realigned PLINK2 files will have the extension .grievousRealigned.grievousReindexed.{pvar, pgen, psam} (unless of course you modified the above template script, in which case then your grievous realigned PLINK2 files will be whatever you called them).

With that we can move on to the next step in the GRIEVOUS pipeline: grievous merge.