Gene Feature Enumeration - nmdp-bioinformatics/dash GitHub Wiki

Gene Feature Enumeration (GFE)

A proposal has been made for a system of enumerated gene features (untranslated regions [UTRs], exons and introns) as an extension of the HLA allele nomenclature (http://biorxiv.org/content/early/2015/02/15/015222).

We expanded and refined the elements of the original GFE proposal as summarized here GFE_update_02202015.pdf.

DaSH II Revisions

Change GFE notation for partial sequences from a decimal (e.g., 8.443) enumeration to a separate enumeration of partial sequences denoted with p, for 'partial' (e.g., p1, p2, p3). A partial sequence is defined as a sequence that is not full-length for a given feature due to a limitation of the typing methodology (e.g., different primer locations). Since a partial sequence can potentially match multiple full-length feature sequences, it may not be valid to identify a given partial sequence as a short version of a particular full-length feature.
Treat unavailable/untyped/untested sequence for a feature as a partial sequence, and denote these as p0. Essentially, a unavailable sequence is a potential match to all full-length feature sequences.
Treat indels as sequence variants and enumerate them as full sequences; these sequence are not full length for a given feature due to biological variation.
Similarly treat deleted features as legitimate sequence variants and enumerate them as full sequences.
Treat duplications of sequence features (e.g., two intron 1(i1) and exon 2 (e2) sequences) in a single gene as nucleotide variants of the second duplicated feature; see GFE_update_02202015.pdf. If i2 and e2 are duplicated (e.g., 5'UTRe1i1e2i1e23'UTR), treat the second i1~e2 as part of the sequence of the first e2. This maintains the field structure for each gene.
Change the delimiter from colons (:) to semi-colons (;) to further distinguish GFE notation from allele names.

Considerations for a GFE Service

We also discussed ways to implement an effective GFE service, and apparent obstacles to an effective serivce.

It is not clear how the respective 5' 3' ends of the 5' and 3' UTRs are defined in the IMGT/HLA Database. The basis of such definitions needs to be clarified for the purpose of defining a full length UTR sequence.
In order to distinguish short feature sequences that distinguish legitimate length variants from partial sequences, the service will need to inspect short sequences for indels via comparison to a reference sequence.
To persist enumerations (and therefore GFE notations) between IMGT/HLA Database release updates, all numbered full-length and partial GFEs should first be re-evaluated against the new database annotations; new, extended or deleted sequences in that database release are evaluated after all extant enumerations have been evaluated, and new (higher number) full-length and partial enumerations assigned.
It would be effective to hash each feature sequence, and then enumerate each unique hashed sequence.
Each hashed sequence feature, and its associated enumeration, should be maintained in the GFE service, even if it appears to have been superseded by a change in the reference database.