DASH VRS - nmdp-bioinformatics/dash GitHub Wiki

DASH VRS

VRS provides a lot of what earlier standards (HGVS, VCF) do not. Chief among these is the ability to make a single (non parsimonious) representation of a genetic variation that is canonical. Its a graph representation that lends itself to computation (better that previous standards). The result of the hackathon was a demonstration that I would like to expand on and write up. They have something similar to feature-service that we wouldn’t have to maintain ourselves, etc.

Background

We lack a bridge between HLA and genomics. HLA has its own nomenclature and approach to characterizing variation that has become pervasive within the transplant community but it opaque to the broader Genomics and Healthcare communities. Even though the term “haplotype” was coined in the context of HLA (by Ruggero Ceppellini at the 3rd IHIW in Turin) [1], it, and the word “allele” have taken on meanings that diverge between the communities.

Why do we want a bridge?

  1. Genomics is underutilized within transplantation. The ability to integrate HLA with other genomic data will increase the power of studies leading to better matching and better outcomes for patients.
  2. HLA is underutilized within Genomics and Health. HLA variant names find their way onto prescription labels [3] and onto requirements for new cancer therapies such as “anti-HLA-A2/NY-ESO-1 TCR-transduced autologous T lymphocytes” [4] without a clear path of how to link this genomic polymorphism. There is interest in “putting HLA in ClinGen”. This hackathon seeks to develop a path to doing just that.

What is VRS? (Abstract from [2]) Maximizing the personal, public, research, and clinical value of genomic information will require the reliable exchange of genetic variation data. We report here the Variation Representation Specification (VRS, pronounced “verse”), an extensible framework for the computable representation of variation that complements contemporary human-readable and flat file standards for genomic variation representation. VRS provides semantically precise representations of variation and leverages this design to enable federated identification of biomolecular variation with globally consistent and unique computed identifiers. The VRS framework includes a terminology and information model, machine-readable schema, data sharing conventions, and a reference implementation, each of which is intended to be broadly useful and freely available for community use. VRS was developed by a partnership among national information resource providers, public initiatives, and diagnostic testing laboratories under the auspices of the Global Alliance for Genomics and Health (GA4GH).

Previous work Aug 12, 2022 a VRS hackathon, organized by GA4GH at the ISMB conference in Madison, WI took on the topic of “translating HLA variation to/from VRS”. This hackathon resulted in a procedure for converting HLA variation into VRS and registering the polymorphism in ClinGen which was demonstrated with one allele. The resulting repo [6] and discussion [7] are available. That successful demonstration led to the following next steps: · Encode all HLA alleles in VRS · Use workshop full references to make more concise variant lists · Explore whether versioned accessions are usable · Explore translate back (from VRS to HLA nomenclature) · Look into set operations (using seqRepo)

Challenges The IPD-IMGT/HLA make a HGVS format of the variants (https://www.ebi.ac.uk/cgi-bin/ipd/pl/hla/get_allele_hgvs.cgi?A*01:01:01:01) but they are described relative to “versioned” accession numbers (e.g. HLA00123.2). Versioned accession numbers don’t work as a genomic reference because the version numbers only reflect changes in the CDS. Non-coding variation is not reflected in the accession number. A second challenge is that the only way to lookup a versioned accession number is to scan through all quarterly releases it is found. A comprehensive table of all versioned accession numbers and their corresponding sequences could be generated and maintained but someone would need to take on that task. But the first issue is already a showstopper.

Recognizing the need for a new approach, it has proposed that we base the HLA VRS encoding on GenBank references corresponding to the full gene sequences of the lineage-specific (table 3 in the 17th IHIW database paper [5]. The HLA.xml and HLA.dat files contain lists of GenBank IDs corresponding to each allele.

This repo https://github.com/nmdp-bioinformatics/imgt_biosqldb is in need of a bit of work but it converts the HLA.DAT to a biosql database as a one-liner. From there one could select a genbank reference for each of the reference alleles above. This would only need to be a one-time list and then, on a quarterly basis, all HLA alleles could be registered with ClinGen (maybe as part of the GFEDB load process).

Hackathon Goals

  1. Review and confirm the approach (proposed above)
  2. Develop a table of genomic references corresponding to the 17th IHIW reference alleles (static) a. This may benefit from using the imgt_biosqldb repo.
  3. Develop a method for converting HLA alleles (from GFEDB) to VRS. a. The reason to use GFE as the source is that there is a 1:1 correspondence between GFE alleles and VRS and GFE already handles the relationship to the sequence across IPD-IMGT/HLA database releases. GFEDB also contains a superset of alleles in IPD-IMGT/HLA.
  4. Register these alleles in ClinGen
  5. Annotate GFE with the ClinGen entry
  6. Annotate ClinGen with the IPD-IMGT/HLA database information (version, name) and GFE

References

[1] https://www.ncbi.nlm.nih.gov/pmc/articles/PMC5267337

[2] https://www.ncbi.nlm.nih.gov/books/NBK321445/

[3] https://www.ncbi.nlm.nih.gov/books/NBK315783/

[4] https://www.cancer.gov/publications/dictionaries/cancer-drug/def/anti-hla-a2-ny-eso-1-tcr-transduced-autologous-t-lymphocytes

[5] https://doi.org/10.1016/j.humimm.2017.12.004

[6] https://github.com/ga4gh/vrs-hackathons/tree/main/session-products

[7] https://github.com/ga4gh/vrs-hackathons/issues/9