Identifying the Maximal Common SNP Subset: grievous intersect - jvtalwar/GRIEVOUS GitHub Wiki

So you want to find out which (biallelic) SNPs exist across all of your datasets? Well, you've come to the right place. The grievous intersect command is designed specifically for this task.

Upon completion of GRIEVOUS realignment (to the same database) and merging for all your genomic datasets of interest, grievous intersect can be run to extract the set of biallelic SNPs common to all datasets. The intersect command comes equipped with specific filters users may wish to enact before extraction and proceeding to their downstream task of choice. Let's take a gander at the command's usage to get a better sense of how we might use grievous intersect to identify all the biallelic SNPs that exist across our tutorial's genomic datasets.

grievous intersect: Usage and Arguments

grievous intersect --cohorts DATASET_1_PATH DATASET_2_PATH ... DATASET_N_PATH 
--write_path  PATH_TO_WHERE_TO_WRITE_INTERSECTING_SNPs 
[--exclude_biallelic_duplicates --exclude_palindromic --out_name FILE_NAME]
  • The --cohorts, -c flag specifies the paths to all the datasets of interest for which you wish to identify all biallelic SNPs common to all datasets. Dataset paths (e.g., DATASET_1_PATH) correspond to completed GRIEVOUS realigned and merged paths, not the original dataset paths. Users can provide any number of (valid) dataset paths to grievous intersect in a space separated manner.
    • This flag is mandatory when running grievous intersect (otherwise how would our esteemed general know where to look and which GRIEVOUS realigned files you would like intersected).
  • --write_path, -w specifies the directory to which GRIEVOUS will write the output file of dataset-intersecting biallelic SNPs. This is also mandatory when running grievous intersect.
  • exclude_biallelic_duplicates, -xd is an optional filter that can be employed when running grievous intersect. When employed, grievous intersect will omit any SNPs that were duplicated in any given cohort from the output file of dataset-intersecting biallelic SNPs.
  • --exclude_palindromic, -xp is another optional filter that can be passed in to grievous intersect. When employed, grievous intersect will omit any palindromic SNPs from the output file of dataset-intersecting biallelic SNPs.
  • --out_name, -n allows you to specify the name of the output file of dataset-intersecting biallelic SNPs. In the absence of this flag, GRIEVOUS will default to naming the intersection output file GRIEVOUS_Intersecting_SNPs.tsv

Intersecting Galaxies: grievous intersect Unleashed

Okay with all of our datasets of interest having been grievous realigned and grievous merged (and the technical mumbo-jumbo of grievous intersect out of the way) we can go ahead and unleash grievous intersect to extract the set of biallelic SNPs found in all of our datasets.

#! /bin/bash
# ADD JOB SCHEDULER (e.g., Slurm, SGE) PARAMS HERE IF NEEDED

source activate ~/miniconda3/envs/GRIEVOUS_ENV # <-- activate environment in which grievous is installed

start_time=$SECONDS # track script run time (optional)

basePath=~/GRIEVOUS_Tutorial/grievousRealigned 
intersectWriteDir=~/GRIEVOUS_Tutorial/IntersectingVariants # <-- define directory where want to write intersecting variants (ensure this directory has been created first)  

#Unleash GRIEVOUS Intersect:
grievous intersect --write_path $intersectWriteDir --cohorts $basePath/Utapau $basePath/MuSSFtafar $basePath/Tatooine -n AllIntersectingSNPs

elapsed=$(( SECONDS - start_time ))
eval "echo Total Run Time/Elapsed time for GRIEVOUS Intersect: $(date -ud "@$elapsed" +'$((%s/3600/24)) days %H hr %M min %S sec')"

In the above sample script, we provide the paths to each dataset's GRIEVOUS realigned and merged directory to --cohorts and define a location as to where GRIEVOUS should write our intersecting SNP set file with --write_path. We also provided an (optional) name for our file "AllIntersectingSNPs" with the -n flag. We did not pass in any filters with our intersection run, though if desired for your purposes, do go ahead and enact those. Upon completion you should see an AllIntersectingSNPs.tsv file (or another file name - e.g., FILE_NAME.tsv, if you provided a different name with -n or omitted the flag entirely) in your provided --write_path.

  • Note 1: If you followed along with the provided tutorial data, and want to confirm the accuracy your grievous run, go ahead and run the validate_tutorial.py evaluation script provided here.
  • Note 2: The intersecting SNP set file generated with grievous intersect is compatible with PLINK2's --extract, if desired for your downstream task(s) of interest.

And just like that we have restored order to the variant galaxy! Across all of our datasets, we have not only guaranteed consistent indexing and orientation for our variants, but also have identified the maximal set of all biallelic SNPs shared by these datasets. Having ensured the meticulous standardization of our data with grievous, you can seamlessly and confidently transition to your desired downstream tasks (e.g., GWAS, PRS application).

More succinctly - We've done all we can. Now the rest is up to you. May the Force be with you.