dbSNPJson - USF-HII/snptk GitHub Wiki
dbSNPJson
Variation Services API
RefSNP API
- https://api.ncbi.nlm.nih.gov/variation/v0/#/RefSNP/get_refsnp__rsid_
- Click to expand and select Schema under Responses Section to view API Documentation
API Call Example
refsnp_id=268
curl -s https://api.ncbi.nlm.nih.gov/variation/v0/refsnp/${refsnp_id} -H "accept: application/json"
SPDI - NCBI Variation Notation for Variants with Known Breakpoints
NCBI Variation Services use a new notation for variations described as Sequence Position Deletion Insertion or SPDI
New SPDI Format Documentation: https://www.ncbi.nlm.nih.gov/variation/notation/
RefSNP JSON
- Download Location for JSON
Scripts
- Internal Script to Generate Table below: dbSNPJson.py
- This script example uses caching of the subset below to speed up subsequent runs
- Generates a frequency table based on entries and their attributes to identify candidates
- Examples from others:
- https://ftp.ncbi.nih.gov/snp/latest_release/JSON/rsjson_demo.py
- This one enforced
is_ptlp
only
- This one enforced
- https://github.com/ncbi/dbsnp/blob/4593a8b8f47c82e3abebb5ce6f0073ba4f3df1cc/lib/python/rsatt.py#L190-L207
- This one enforced
is_ptlp
andif trait['is_top_level'] and trait['is_chromosome']
- This one enforced
- https://ftp.ncbi.nih.gov/snp/latest_release/JSON/rsjson_demo.py
The https://ftp.ncbi.nih.gov/snp/latest_release/JSON/rsjson_demo.py limits
Subset we are interested in:
rs_obj
refsnp_id
last_update_build_id
primary_snapshot_data
variant_type
placements_with_allele
- alleles
- allele
spdi
deleted_sequence
inserted_sequence
position
seq_id
is_ptlp
placement_annot
is_aln_opposite_orientation
is_mismatch
mol_type
seq_id_traits_by_assembly
- assembly_accession
assembly_name
is_alt
is_chromosome
is_patch
is_top_level
- ...
- ...
placements_with_allele
- List of all placements, each containing a list of the alleles in the context of that reference region. Includes both nucleotide and protein placements. For all placements, the list of alleles is the same length and in the same order. They represent the alleles on the preferred top level placement (PTLP) in sorted order. So, on a given placement, the first allele is how the first allele on the PTLP appears upon mapping to the given placement. Similarly for the second allele, and so on.placement_with_allele
- For nucleotide placements, each allele in this placement is stored as a SPDI. Each SPDI contains a reference sequence identifier. For the set of alleles in SPDI syntax, they must all have the same reference sequence identifier, that matches theseq_id
attribute. For protein placements, the allele is either in SPDI syntax, or gives a general description of a frameshift.alleles
- A RefSnp can describe 1 to N alleles. While for the PTLP, all alleles (in SPDI syntax) are the same type and length, non-PTLP Placements (i.e. all other Placements), the allele in SPDI syntax may have different types, lengths and start positions, and in some rare cases, may not even overlap. But they are all on this placement's Sequence.allele
- An allele object describes the sequence change in either nucleotide or amino acid sequence at a particular position on a particular sequence. Exactly one of the spdi and frameshift fields will be present. Most alleles can be modeled as defined breakpoint changes (SPDI data model). However, indel nucleotide changes can create frameshift changes in proteins, which are difficult to model this way. Therefore this allele object is a choice between either a SPDI representation (when the breakpoints are known) or protein frameshifts (where the final breakpoint is not known).spdi
- A single contextual allele in SPDI notation. Contextual allele means that applying the Blossom Precision Correction Algorithm would leave the fields unchanged.seq_id
- The RefSeq/Genbank Accession.Version for the reference sequenceposition
- The 0-based boundary position where the deletion starts. That is, position 0 starts the deletion immediately before the first nucleotide and position 1 starts the deletion between the first and second nucleotides.deleted_sequence
- The IUPAC sequence of nucleotides/amino-acids to delete from the reference. This can be empty, which is how a pure insertion is represented.inserted_sequence
- The IUPAC sequence of nucleotides/amino-acids to insert after perforing the deletion. Amino-acids use the single character encoding. All nucleotide codes including the ones for ambiguity are allowed.warnings
(list) - Text intended for human consumption listing all warnings associated with generating this object. Absent if no warnings were generated.
is_ptlp
- True if this placement is the preferred top level placement (PTLP) under the alignment data set which generated this RefSnp clusterplacement_annot
- Annotation about this sequenceis_aln_opposite_orientation
- True if this sequence is aligned reverse to the PTLP sequence. Thus, the PTLP sequence's is_aln_opposite_orientation attribute is always false.is_mismatch
- True if this sequence's residues are different than the PTLP sequence at this locus. Thus, the PTLP sequence's is_mismatch attribute is always false.seq_id_traits_by_assembly
- The relationships between this sequence and the genomic assemblies in which it participates (if any)assembly_name
- The name of the assembly these traits referenceassembly_accession
- The Genomic Collections accession for this assembly. For more information, see, for example, http://www.ncbi.nlm.nih.gov/assembly/GCF_000001405.38/is_top_level
- True if the sequence is top-level (the most highly assembled sequences in a genome assemblyis_alt
- True if this placement's sequence is an alternative loci (a sequence that provides an alternate representation of a locus found in a largely haploid assembly)is_patch
- True if this placement's sequence is a patch sequence (a contig sequence that is released outside of the full assembly release cycle. These sequences are meant to add information to the assembly without disrupting the stable coordinate system)is_chromosome
- True if this placement's sequence is a chromosome sequence (a relatively complete pseudo-molecule assembled from smaller sequences (components) that represent a biological chromosome)
Breakdown of B153 via Variant Services RefSNP JSON
Better format to view here: https://gist.github.com/countdigi/4a467709da85caa51d54fe10c64dd4c9
Chr22
idx count name ptlp opp mism mol alt chrom patch top rsids
1 9139281 38.p12 True False False genomic False True False True 782,783,795
2 8315064 37.p13 False False False genomic False True False True 782,783,800
3 310452 38.p12 False False False genomic True False False True 1964,2719,3484
4 206704 37.p13 False False False genomic False False True True 7245,10248,10794
5 198592 37.p13 False True False genomic False True False True 10083,117222,131488
6 120216 37.p13 False False False genomic False False False True 366279,366777,372180
7 73063 38.p12 False False False genomic False False False False 504728,545765,656963
8 72307 38.p12 False False False genomic False False True True 10451,16947,131198
9 66116 37.p13 False True False genomic False False False True 79883,364621,369381
10 7792 38.p12 False True False genomic False False False True 408896,410993,3895592
11 5241 37.p13 False False True genomic False True False True 7104,7117,11109
12 2493 38.p12 False True False genomic False True False True 1946100,34109265,34453600
13 2455 37.p13 False True False genomic False False True True 1946100,34109265,34453600
14 1489 37.p13 False False True genomic False False True True 25274,78489,131515
15 1033 38.p12 False False True genomic True False False True 16947,23669,28557
16 809 37.p13 False True True genomic False True False True 738829,738830,915675
17 751 37.p13 True False False genomic False True False True 695520,2818600,3047462
18 536 38.p12 False False True genomic False False True True 7245,713811,714002
19 254 38.p12 False True True genomic False True False True 1894501,1946101,1946102
20 217 37.p13 False True True genomic False False True True 1894501,1946101,1946102
21 163 37.p13 False True True genomic False False False True 1818579,1962473,2007219
22 27 38.p12 False True True genomic False False False True 1807482,2379834,5748858
23 7 38.p12 False False False genomic False True False True 375611631,782043702,782502812
Breakdown of Build 151 SNPChrPosOnRef
GRCh38.p7 Build 151
r38=/shares/hii/bioinfo/ref/ncbi/human_9606_b151_GRCh38p7/b151_SNPChrPosOnRef_108.bcp.gz
$ zcat $r38 | awk -F'\t' '{print $2}' | sort | uniq -c
51705524 1
55501526 2
45423397 3
43701131 4
40977338 5
38316772 6
36568839 7
34770927 8
28775453 9
30525979 10
31276171 11
30325840 12
22389429 13
20442817 14
19101963 15
21011084 16
18601096 17
17708551 18
14227849 19
14550654 20
8717952 21
9066760 22
25367046 X
473119 Y
2297 MT
800312 PAR
39146 AltOnly
331710 NotOn
72445 Un
GRCh37.p13 Build 151
rs37=/shares/hii/bioinfo/ref/ncbi/human_9606_b151_GRCh37p13/b151_SNPChrPosOnRef_105.bcp.gz
$ zcat $r37 | awk -F'\t' '{print $2}' | sort | uniq -c
49190997 1
54502333 2
44571425 3
43360216 4
40518121 5
37745012 6
35113619 7
34393676 8
28074089 9
29126303 10
30019009 11
29475839 12
21548359 13
19918518 14
18833397 15
20626171 16
17445536 17
17008724 18
13756018 19
13664413 20
8348337 21
8374360 22
22345464 X
468465 Y
2296 MT
234386 PAR
10064120 AltOnly
11780576 NotOn
263303 Un
Notes
A support section in the JSON can help us determine when a SNP was added, for example this one (rs1569537923) was added in Build 153 (although we may have to cycle on multiple support entries for earliest Build value.
"support": [
{
"id": {
"type": "subsnp",
"value": "ss3759385361"
},
"revision_added": "153",
"create_date": "2019-07-14T03:23Z",
"submitter_handle": "EVA"
}
],