Identifying the Maximal Common SNP Subset: grievous intersect - jvtalwar/GRIEVOUS GitHub Wiki
So you want to find out which (biallelic) SNPs exist across all of your datasets? Well, you've come to the right place. The grievous intersect command is designed specifically for this task.
Upon completion of GRIEVOUS realignment (to the same database) and merging for all your genomic datasets of interest, grievous intersect can be run to extract the set of biallelic SNPs common to all datasets. The intersect command comes equipped with specific filters users may wish to enact before extraction and proceeding to their downstream task of choice. Let's take a gander at the command's usage to get a better sense of how we might use grievous intersect to identify all the biallelic SNPs that exist across our tutorial's genomic datasets.
grievous intersect: Usage and Arguments
grievous intersect --cohorts DATASET_1_PATH DATASET_2_PATH ... DATASET_N_PATH
--write_path PATH_TO_WHERE_TO_WRITE_INTERSECTING_SNPs
[--exclude_biallelic_duplicates --exclude_palindromic --out_name FILE_NAME]
- The
--cohorts, -cflag specifies the paths to all the datasets of interest for which you wish to identify all biallelic SNPs common to all datasets. Dataset paths (e.g., DATASET_1_PATH) correspond to completed GRIEVOUS realigned and merged paths, not the original dataset paths. Users can provide any number of (valid) dataset paths togrievous intersectin a space separated manner.- This flag is mandatory when running
grievous intersect(otherwise how would our esteemed general know where to look and which GRIEVOUS realigned files you would like intersected).
- This flag is mandatory when running
--write_path, -wspecifies the directory to which GRIEVOUS will write the output file of dataset-intersecting biallelic SNPs. This is also mandatory when runninggrievous intersect.exclude_biallelic_duplicates, -xdis an optional filter that can be employed when runninggrievous intersect. When employed,grievous intersectwill omit any SNPs that were duplicated in any given cohort from the output file of dataset-intersecting biallelic SNPs.--exclude_palindromic, -xpis another optional filter that can be passed in togrievous intersect. When employed,grievous intersectwill omit any palindromic SNPs from the output file of dataset-intersecting biallelic SNPs.--out_name, -nallows you to specify the name of the output file of dataset-intersecting biallelic SNPs. In the absence of this flag, GRIEVOUS will default to naming the intersection output file GRIEVOUS_Intersecting_SNPs.tsv
Intersecting Galaxies: grievous intersect Unleashed
Okay with all of our datasets of interest having been grievous realigned and grievous merged (and the technical mumbo-jumbo of grievous intersect out of the way) we can go ahead and unleash grievous intersect to extract the set of biallelic SNPs found in all of our datasets.
#! /bin/bash
# ADD JOB SCHEDULER (e.g., Slurm, SGE) PARAMS HERE IF NEEDED
source activate ~/miniconda3/envs/GRIEVOUS_ENV # <-- activate environment in which grievous is installed
start_time=$SECONDS # track script run time (optional)
basePath=~/GRIEVOUS_Tutorial/grievousRealigned
intersectWriteDir=~/GRIEVOUS_Tutorial/IntersectingVariants # <-- define directory where want to write intersecting variants (ensure this directory has been created first)
#Unleash GRIEVOUS Intersect:
grievous intersect --write_path $intersectWriteDir --cohorts $basePath/Utapau $basePath/MuSSFtafar $basePath/Tatooine -n AllIntersectingSNPs
elapsed=$(( SECONDS - start_time ))
eval "echo Total Run Time/Elapsed time for GRIEVOUS Intersect: $(date -ud "@$elapsed" +'$((%s/3600/24)) days %H hr %M min %S sec')"
In the above sample script, we provide the paths to each dataset's GRIEVOUS realigned and merged directory to --cohorts and define a location as to where GRIEVOUS should write our intersecting SNP set file with --write_path. We also provided an (optional) name for our file "AllIntersectingSNPs" with the -n flag. We did not pass in any filters with our intersection run, though if desired for your purposes, do go ahead and enact those. Upon completion you should see an AllIntersectingSNPs.tsv file (or another file name - e.g., FILE_NAME.tsv, if you provided a different name with -n or omitted the flag entirely) in your provided --write_path.
- Note 1: If you followed along with the provided tutorial data, and want to confirm the accuracy your
grievousrun, go ahead and run thevalidate_tutorial.pyevaluation script provided here. - Note 2: The intersecting SNP set file generated with
grievous intersectis compatible with PLINK2's --extract, if desired for your downstream task(s) of interest.
And just like that we have restored order to the variant galaxy! Across all of our datasets, we have not only guaranteed consistent indexing and orientation for our variants, but also have identified the maximal set of all biallelic SNPs shared by these datasets. Having ensured the meticulous standardization of our data with grievous, you can seamlessly and confidently transition to your desired downstream tasks (e.g., GWAS, PRS application).
More succinctly - We've done all we can. Now the rest is up to you. May the Force be with you.