Gene Feature Enumeration - nmdp-bioinformatics/dash GitHub Wiki
Gene Feature Enumeration (GFE)
A proposal has been made for a system of enumerated gene features (untranslated regions [UTRs], exons and introns) as an extension of the HLA allele nomenclature (http://biorxiv.org/content/early/2015/02/15/015222).
We expanded and refined the elements of the original GFE proposal as summarized here GFE_update_02202015.pdf.
DaSH II Revisions
- Change GFE notation for partial sequences from a decimal (e.g., 8.443) enumeration to a separate enumeration of partial sequences denoted with p, for 'partial' (e.g., p1, p2, p3). A partial sequence is defined as a sequence that is not full-length for a given feature due to a limitation of the typing methodology (e.g., different primer locations). Since a partial sequence can potentially match multiple full-length feature sequences, it may not be valid to identify a given partial sequence as a short version of a particular full-length feature.
- Treat unavailable/untyped/untested sequence for a feature as a partial sequence, and denote these as p0. Essentially, a unavailable sequence is a potential match to all full-length feature sequences.
- Treat indels as sequence variants and enumerate them as full sequences; these sequence are not full length for a given feature due to biological variation.
- Similarly treat deleted features as legitimate sequence variants and enumerate them as full sequences.
- Treat duplications of sequence features (e.g., two intron 1(i1) and exon 2 (e2) sequences) in a single gene as nucleotide variants of the second duplicated feature; see GFE_update_02202015.pdf. If i2 and e2 are duplicated (e.g., 5'UTR
e1i1e2i1e23'UTR), treat the second i1~e2 as part of the sequence of the first e2. This maintains the field structure for each gene. - Change the delimiter from colons (:) to semi-colons (;) to further distinguish GFE notation from allele names.
Considerations for a GFE Service
We also discussed ways to implement an effective GFE service, and apparent obstacles to an effective serivce.
- It is not clear how the respective 5' 3' ends of the 5' and 3' UTRs are defined in the IMGT/HLA Database. The basis of such definitions needs to be clarified for the purpose of defining a full length UTR sequence.
- In order to distinguish short feature sequences that distinguish legitimate length variants from partial sequences, the service will need to inspect short sequences for indels via comparison to a reference sequence.
- To persist enumerations (and therefore GFE notations) between IMGT/HLA Database release updates, all numbered full-length and partial GFEs should first be re-evaluated against the new database annotations; new, extended or deleted sequences in that database release are evaluated after all extant enumerations have been evaluated, and new (higher number) full-length and partial enumerations assigned.
- It would be effective to hash each feature sequence, and then enumerate each unique hashed sequence.
- Each hashed sequence feature, and its associated enumeration, should be maintained in the GFE service, even if it appears to have been superseded by a change in the reference database.