Updated Readme.d - USF-HII/snptk GitHub Wiki

Processing GRCh37 dbsnp VCF data

GRCh37 is a little different than GRCh38 as NCBI does not provide json formatted files. Instead a GRCh37 VCF file is provided and located at https://ftp.ncbi.nih.gov/snp/archive/b153/VCF/GCF_000001405.25.gz. To process this file, we have provided a script inside of bin called snptk-parse-dbsnp-vcf which requires bcftools and plink to use.

Usage of snptk-parse-dbsnp-vcf:

   snptk-parse-dbsnp-vcf <input_vcf> <output_file>

Although snptk-parse-dbsnp-vcf uses bcftools to extract snp information, the output does not contain correct chromosome assignments. In order to fix this issue, we need to map snp IDs to GRCh38 to assign correct chromosome value. To do this, we have a provided a script inside of bin called snptk-map-grch37-chromosomes.py to use.

Usage of snptk-map-grch37-chromosomes.py:

   python3 snptk-map-grch37-chromosomes.py --grch37_dbsnp grch37.gz --grch38_dbsnp grch38.gz --outfile <outfile>

This will produce 2 files:

 <outfile>.gz
 <outfile>_multi_entries.gz

<outfile>_multi_entries.gz contains snps that have multi entries inside of the VCF file. These snps have 2 or more valid chromosome and positions listed inside of the VCF and this file can be passed in as an argument in feature snptk map-using-rs-id --include-file

Concurrency

Since the reference files snptk deals with are rather large in number of records we have included a split utility to read the original file and split it into chunks within a directory.

If the input file is a directory as opposed to a file, the utility will use concurrent.futures.ProcessPoolExecutor() to parse all of the files in the directory to increase speed. It will use as many processes as there are files in the directory - currently 32 is a good guideline (the most expected on any node).

The recommended directory structure for a split file is <file_path>.d/01, <file_path>.d/02, etc.

For example:

snptk-split \
  /shares/hii/bioinfo/ref/ncbi/human_9606_b151_GRCh38p7/b151_SNPChrPosOnRef_108.bcp.gz \
  tmp/data/grch38p7/dbsnp.d/ \
  32
⚠️ **GitHub.com Fallback** ⚠️