dbSNPJson - USF-HII/snptk GitHub Wiki

dbSNPJson

Variation Services API

RefSNP API

https://api.ncbi.nlm.nih.gov/variation/v0/#/RefSNP/get_refsnp__rsid_
Click to expand and select Schema under Responses Section to view API Documentation

API Call Example

refsnp_id=268

curl -s https://api.ncbi.nlm.nih.gov/variation/v0/refsnp/${refsnp_id} -H "accept: application/json"

SPDI - NCBI Variation Notation for Variants with Known Breakpoints

NCBI Variation Services use a new notation for variations described as Sequence Position Deletion Insertion or SPDI

New SPDI Format Documentation: https://www.ncbi.nlm.nih.gov/variation/notation/

RefSNP JSON

Download Location for JSON
- Build 153 https://ftp.ncbi.nih.gov/snp/archive/b153/JSON/
- Latest https://ftp.ncbi.nih.gov/snp/latest_release/JSON/

Scripts

Internal Script to Generate Table below: dbSNPJson.py
- This script example uses caching of the subset below to speed up subsequent runs
- Generates a frequency table based on entries and their attributes to identify candidates
Examples from others:
- https://ftp.ncbi.nih.gov/snp/latest_release/JSON/rsjson_demo.py
  - This one enforced is_ptlp only
- https://github.com/ncbi/dbsnp/blob/4593a8b8f47c82e3abebb5ce6f0073ba4f3df1cc/lib/python/rsatt.py#L190-L207
  - This one enforced is_ptlp and if trait['is_top_level'] and trait['is_chromosome']

The https://ftp.ncbi.nih.gov/snp/latest_release/JSON/rsjson_demo.py limits

Subset we are interested in:

rs_obj
  refsnp_id
  last_update_build_id
  primary_snapshot_data
    variant_type
    placements_with_allele
      - alleles
          - allele
              spdi
                deleted_sequence
                inserted_sequence
                position
                seq_id
        is_ptlp
        placement_annot
          is_aln_opposite_orientation
          is_mismatch
          mol_type
          seq_id_traits_by_assembly
            - assembly_accession
              assembly_name
              is_alt
              is_chromosome
              is_patch
              is_top_level
            - ...
      - ...

placements_with_allele - List of all placements, each containing a list of the alleles in the context of that reference region. Includes both nucleotide and protein placements. For all placements, the list of alleles is the same length and in the same order. They represent the alleles on the preferred top level placement (PTLP) in sorted order. So, on a given placement, the first allele is how the first allele on the PTLP appears upon mapping to the given placement. Similarly for the second allele, and so on.
- placement_with_allele - For nucleotide placements, each allele in this placement is stored as a SPDI. Each SPDI contains a reference sequence identifier. For the set of alleles in SPDI syntax, they must all have the same reference sequence identifier, that matches the seq_id attribute. For protein placements, the allele is either in SPDI syntax, or gives a general description of a frameshift.
  - alleles - A RefSnp can describe 1 to N alleles. While for the PTLP, all alleles (in SPDI syntax) are the same type and length, non-PTLP Placements (i.e. all other Placements), the allele in SPDI syntax may have different types, lengths and start positions, and in some rare cases, may not even overlap. But they are all on this placement's Sequence.
    - allele - An allele object describes the sequence change in either nucleotide or amino acid sequence at a particular position on a particular sequence. Exactly one of the spdi and frameshift fields will be present. Most alleles can be modeled as defined breakpoint changes (SPDI data model). However, indel nucleotide changes can create frameshift changes in proteins, which are difficult to model this way. Therefore this allele object is a choice between either a SPDI representation (when the breakpoints are known) or protein frameshifts (where the final breakpoint is not known).
      - spdi - A single contextual allele in SPDI notation. Contextual allele means that applying the Blossom Precision Correction Algorithm would leave the fields unchanged.
        
        seq_id - The RefSeq/Genbank Accession.Version for the reference sequence
        
        position - The 0-based boundary position where the deletion starts. That is, position 0 starts the deletion immediately before the first nucleotide and position 1 starts the deletion between the first and second nucleotides.
        
        deleted_sequence - The IUPAC sequence of nucleotides/amino-acids to delete from the reference. This can be empty, which is how a pure insertion is represented.
        
        inserted_sequence - The IUPAC sequence of nucleotides/amino-acids to insert after perforing the deletion. Amino-acids use the single character encoding. All nucleotide codes including the ones for ambiguity are allowed.
        
        warnings (list) - Text intended for human consumption listing all warnings associated with generating this object. Absent if no warnings were generated.
  - is_ptlp - True if this placement is the preferred top level placement (PTLP) under the alignment data set which generated this RefSnp cluster
  - placement_annot - Annotation about this sequence
    - is_aln_opposite_orientation - True if this sequence is aligned reverse to the PTLP sequence. Thus, the PTLP sequence's is_aln_opposite_orientation attribute is always false.
    - is_mismatch - True if this sequence's residues are different than the PTLP sequence at this locus. Thus, the PTLP sequence's is_mismatch attribute is always false.
    - seq_id_traits_by_assembly - The relationships between this sequence and the genomic assemblies in which it participates (if any)
      - assembly_name - The name of the assembly these traits reference
      - assembly_accession - The Genomic Collections accession for this assembly. For more information, see, for example, http://www.ncbi.nlm.nih.gov/assembly/GCF_000001405.38/
      - is_top_level - True if the sequence is top-level (the most highly assembled sequences in a genome assembly
      - is_alt - True if this placement's sequence is an alternative loci (a sequence that provides an alternate representation of a locus found in a largely haploid assembly)
      - is_patch - True if this placement's sequence is a patch sequence (a contig sequence that is released outside of the full assembly release cycle. These sequences are meant to add information to the assembly without disrupting the stable coordinate system)
      - is_chromosome - True if this placement's sequence is a chromosome sequence (a relatively complete pseudo-molecule assembled from smaller sequences (components) that represent a biological chromosome)

Breakdown of B153 via Variant Services RefSNP JSON

Better format to view here: https://gist.github.com/countdigi/4a467709da85caa51d54fe10c64dd4c9

Chr22

idx     count   name    ptlp    opp     mism    mol     alt     chrom   patch   top     rsids
1       9139281 38.p12  True    False   False   genomic False   True    False   True    782,783,795
2       8315064 37.p13  False   False   False   genomic False   True    False   True    782,783,800
3       310452  38.p12  False   False   False   genomic True    False   False   True    1964,2719,3484
4       206704  37.p13  False   False   False   genomic False   False   True    True    7245,10248,10794
5       198592  37.p13  False   True    False   genomic False   True    False   True    10083,117222,131488
6       120216  37.p13  False   False   False   genomic False   False   False   True    366279,366777,372180
7       73063   38.p12  False   False   False   genomic False   False   False   False   504728,545765,656963
8       72307   38.p12  False   False   False   genomic False   False   True    True    10451,16947,131198
9       66116   37.p13  False   True    False   genomic False   False   False   True    79883,364621,369381
10      7792    38.p12  False   True    False   genomic False   False   False   True    408896,410993,3895592
11      5241    37.p13  False   False   True    genomic False   True    False   True    7104,7117,11109
12      2493    38.p12  False   True    False   genomic False   True    False   True    1946100,34109265,34453600
13      2455    37.p13  False   True    False   genomic False   False   True    True    1946100,34109265,34453600
14      1489    37.p13  False   False   True    genomic False   False   True    True    25274,78489,131515
15      1033    38.p12  False   False   True    genomic True    False   False   True    16947,23669,28557
16      809     37.p13  False   True    True    genomic False   True    False   True    738829,738830,915675
17      751     37.p13  True    False   False   genomic False   True    False   True    695520,2818600,3047462
18      536     38.p12  False   False   True    genomic False   False   True    True    7245,713811,714002
19      254     38.p12  False   True    True    genomic False   True    False   True    1894501,1946101,1946102
20      217     37.p13  False   True    True    genomic False   False   True    True    1894501,1946101,1946102
21      163     37.p13  False   True    True    genomic False   False   False   True    1818579,1962473,2007219
22      27      38.p12  False   True    True    genomic False   False   False   True    1807482,2379834,5748858
23      7       38.p12  False   False   False   genomic False   True    False   True    375611631,782043702,782502812

Breakdown of Build 151 SNPChrPosOnRef

GRCh38.p7 Build 151

r38=/shares/hii/bioinfo/ref/ncbi/human_9606_b151_GRCh38p7/b151_SNPChrPosOnRef_108.bcp.gz

$ zcat $r38 | awk -F'\t' '{print $2}' | sort | uniq -c

51705524 1
55501526 2
45423397 3
43701131 4
40977338 5
38316772 6
36568839 7
34770927 8
28775453 9
30525979 10
31276171 11
30325840 12
22389429 13
20442817 14
19101963 15
21011084 16
18601096 17
17708551 18
14227849 19
14550654 20
8717952  21
9066760  22
25367046 X
 473119  Y
   2297  MT
 800312  PAR
  39146  AltOnly
 331710  NotOn
  72445  Un

GRCh37.p13 Build 151

rs37=/shares/hii/bioinfo/ref/ncbi/human_9606_b151_GRCh37p13/b151_SNPChrPosOnRef_105.bcp.gz

$ zcat $r37 | awk -F'\t' '{print $2}' | sort | uniq -c

49190997 1
54502333 2
44571425 3
43360216 4
40518121 5
37745012 6
35113619 7
34393676 8
28074089 9
29126303 10
30019009 11
29475839 12
21548359 13
19918518 14
18833397 15
20626171 16
17445536 17
17008724 18
13756018 19
13664413 20
8348337  21
8374360  22
22345464 X
 468465  Y
   2296  MT
 234386  PAR
10064120 AltOnly
11780576 NotOn
 263303  Un

Notes

A support section in the JSON can help us determine when a SNP was added, for example this one (rs1569537923) was added in Build 153 (although we may have to cycle on multiple support entries for earliest Build value.

    "support": [
      {
        "id": {
          "type": "subsnp",
          "value": "ss3759385361"
        },
        "revision_added": "153",
        "create_date": "2019-07-14T03:23Z",
        "submitter_handle": "EVA"
      }
    ],