Locatables - WormBase/db-prototypes GitHub Wiki
(schema locatable
(fields
;;
;; core attributes, used for all features
;;
[parent :ref
"An entity (e.g. sequence or protein) which defines the coordinate system for this locatable."
[min :long :indexed
"The lower bound of a half-open (UCSC-style) interval defining the location."
[max :long :indexed
"The upper bound of a half-open (UCSC-style) interval defining the location.
[strand :enum [positive negative]
"Token designating the strand or orientation of this feature. Omit if unknown or irrelevant."
[method :ref
"Method entity defining the meaning of this feature. Required for lightweight features."]
;;
;; Attributes from ?Feature_data and #Feature_info -- used for lightweight features
;;
[score :float
"Feature score, as used in ?Feature_data."]
[note :string :many
"Human-readable note associated with a lightweight feature."]
;;
;; Binning system
;;
[murmur-bin :long :indexed
"Bottom 20 bits contain a UCSC/BAM-style bin number. High bits contain a Murmur3 hash code
for the parent sequence. Only used for locatables attached to a parent with a :sequence/id."]
;;
;; Assembly support
;;
[assembly-parent :ref
"The parent sequence in a genome assembly."]
))
Objects which are currently referenced via S_child tags will have a (at minimum), :locatable/parent, :locatable/min, and :locatable/max attributes, e.g.:
{:gene/id "WBGene00001488"
:gene/public-name "frm-1"
:gene/version 1
:locatable/parent [:sequence/id "CHROMOSOME_I"]
:locatable/min 14875783
:locatable/max 14900330
:locatable/strand :locatable.strand/negative
:locatable/method [:method/id "Gene"]
:gene/reference [:paper/id "WBPaper00024261"](/WormBase/db-prototypes/wiki/:paper/id-"WBPaper00024261")
;; etc.}
Note that a locatable has exactly one location (like an ACeDB S_child). It also has a contiguous span (like a single GFF record). Any substructure must be represented separately. For now, I propose that we leave the current structure of Gene/Transcript/CDS/etc. alone, and represent exon structures as they currently are, using the :transcript/source-exons attribute (see transcript/ZC247.3 for an example). Per discussions about representing complex loci and outrons, we might want to change this a bit in the future, but I'd like to concentrate on one thing at a time!
Binning
As with most non-geospatial databases, Datomic doesn't have a natural way to do interval queries (i.e. "give me the genes between 1Mb and 2Mb along CHROMOSOME_I"). To support this kind of query, every locatable attached to a Sequence object has a :locatable/murmur-bin attribute. This consists of:
- A sequence location bin, calculated as in the BAM or Tabix specifications ORed with
- The Murmur3 hash code of the parent sequence's name (shifted left 20 bits).
See also, the wb.binning
namespace.
Lightweight features
"Lightweight features" are currently stored as rows of ?Feature_data objects. In the Datomic world, I don't think there's a great deal of benefit in boxcarring features into big feature-data containers -- instead, I suggest that each one should be a standalone entity using attributes from the locatable namespace. Note that Datomic is a lot more comfortable with the idea of "anonymous" objects than ACeDB. All the ACeDB-derived objects have primary identifiers stored in attributes like :feature/id or :gene/id, but this isn't required. Lightweight features which just have an internal Datomic entity ID (an opaque 64-bit integer), which will hopefully never be seen outside the database.
{:locatable/parent [:sequence/id "2L52"]
:locatable/min 1362
:locatable/max 1475
:locatable/method [:method/id "modENCODE_2431_TF_HLH-1_Stage_embryo"]
:locatable/score 74.4599
:locatable/note ["ChIP-Seq TF binding region for HLH-1"
"modENCODE ID 2431 (WBPaper00045904)"]}
Splice site confirmations ("Splice" records within ?Feature_data) are a special case of lightweight features. They can be represented using core locateable attributes, plus:
(schema splice-confirm
(fields
[cdna :ref
"cdna entity which supports this intron."]
[est :ref
"sequence entity of an EST which supports this intron."]
[ost :ref
"sequence entity of an OST which supports this intron."]
[rst :ref
"sequence entity of an RST which supports this intron."]
[mrna :ref
"sequence entity of an mRNA which supports this intron."]
[utr :ref
"sequence entity of a UTR which supports this intron."]
[rnaseq :ref :component
"Details of RNA-seq data supporting this intron (uses splice-confirm.rna namespace)."]
[mass-spec :ref
"mass-spec-peptide entity which supports this intron."]
[homology :string
"accession number of an external database record which supports this intron (is this used?)."]
[false-splice :ref
"sequence entity providing evidence for a false splice site call."]
[inconsistent :ref
"sequence entity providing evidence for an inconsistent splice site call."]))
(schema splice-confirm.rnaseq
(fields
[analysis :ref
"Analysis entity describing the RNA-seq dataset."]
[count :long
"Number of reads supporting the intron."]))
Homologies
These are a special case of lightweight features. See here.