Locatables - WormBase/db-prototypes GitHub Wiki

(schema locatable
 (fields

   ;;
   ;; core attributes, used for all features
   ;;

   [parent :ref
	  "An entity (e.g. sequence or protein) which defines the coordinate system for this locatable."
   [min :long :indexed
	  "The lower bound of a half-open (UCSC-style) interval defining the location."
   [max :long :indexed
	  "The upper bound of a half-open (UCSC-style) interval defining the location.
   [strand :enum [positive negative]
	  "Token designating the strand or orientation of this feature.  Omit if unknown or irrelevant."
   [method :ref
	  "Method entity defining the meaning of this feature.  Required for lightweight features."]

   ;;
   ;; Attributes from ?Feature_data and #Feature_info -- used for lightweight features
   ;;

   [score :float
	  "Feature score, as used in ?Feature_data."]
   [note :string :many
	  "Human-readable note associated with a lightweight feature."]


   ;;
   ;; Binning system
   ;;

   [murmur-bin :long :indexed
	  "Bottom 20 bits contain a UCSC/BAM-style bin number.  High bits contain a Murmur3 hash code
	   for the parent sequence.  Only used for locatables attached to a parent with a :sequence/id."]

   ;;
   ;; Assembly support
   ;;

   [assembly-parent :ref
	  "The parent sequence in a genome assembly."]
   ))

Objects which are currently referenced via S_child tags will have a (at minimum), :locatable/parent, :locatable/min, and :locatable/max attributes, e.g.:

{:gene/id             "WBGene00001488"
 :gene/public-name    "frm-1"
 :gene/version        1
 :locatable/parent    [:sequence/id "CHROMOSOME_I"]
 :locatable/min       14875783
 :locatable/max       14900330
 :locatable/strand    :locatable.strand/negative
 :locatable/method    [:method/id "Gene"]
 :gene/reference      [:paper/id "WBPaper00024261"](/WormBase/db-prototypes/wiki/:paper/id-"WBPaper00024261")
 ;; etc.}

Note that a locatable has exactly one location (like an ACeDB S_child). It also has a contiguous span (like a single GFF record). Any substructure must be represented separately. For now, I propose that we leave the current structure of Gene/Transcript/CDS/etc. alone, and represent exon structures as they currently are, using the :transcript/source-exons attribute (see transcript/ZC247.3 for an example). Per discussions about representing complex loci and outrons, we might want to change this a bit in the future, but I'd like to concentrate on one thing at a time!

Binning

As with most non-geospatial databases, Datomic doesn't have a natural way to do interval queries (i.e. "give me the genes between 1Mb and 2Mb along CHROMOSOME_I"). To support this kind of query, every locatable attached to a Sequence object has a :locatable/murmur-bin attribute. This consists of:

A sequence location bin, calculated as in the BAM or Tabix specifications ORed with
The Murmur3 hash code of the parent sequence's name (shifted left 20 bits).

Lightweight features

"Lightweight features" are currently stored as rows of ?Feature_data objects. In the Datomic world, I don't think there's a great deal of benefit in boxcarring features into big feature-data containers -- instead, I suggest that each one should be a standalone entity using attributes from the locatable namespace. Note that Datomic is a lot more comfortable with the idea of "anonymous" objects than ACeDB. All the ACeDB-derived objects have primary identifiers stored in attributes like :feature/id or :gene/id, but this isn't required. Lightweight features which just have an internal Datomic entity ID (an opaque 64-bit integer), which will hopefully never be seen outside the database.

{:locatable/parent      [:sequence/id "2L52"]
 :locatable/min         1362
 :locatable/max         1475
 :locatable/method      [:method/id "modENCODE_2431_TF_HLH-1_Stage_embryo"]
 :locatable/score       74.4599
 :locatable/note        ["ChIP-Seq TF binding region for HLH-1"
						 "modENCODE ID 2431 (WBPaper00045904)"]}

Splice site confirmations ("Splice" records within ?Feature_data) are a special case of lightweight features. They can be represented using core locateable attributes, plus:

(schema splice-confirm
 (fields
  [cdna :ref
	 "cdna entity which supports this intron."]
  [est :ref
	 "sequence entity of an EST which supports this intron."]
  [ost :ref
	 "sequence entity of an OST which supports this intron."]
  [rst :ref
	 "sequence entity of an RST which supports this intron."]
  [mrna :ref
	 "sequence entity of an mRNA which supports this intron."]
  [utr :ref
	 "sequence entity of a UTR which supports this intron."]
  [rnaseq :ref :component
	 "Details of RNA-seq data supporting this intron (uses splice-confirm.rna namespace)."]
  [mass-spec :ref
	 "mass-spec-peptide entity which supports this intron."]
  [homology :string
	 "accession number of an external database record which supports this intron (is this used?)."]
  [false-splice :ref
	 "sequence entity providing evidence for a false splice site call."]
  [inconsistent :ref
	 "sequence entity providing evidence for an inconsistent splice site call."]))

(schema splice-confirm.rnaseq
 (fields
   [analysis :ref
	 "Analysis entity describing the RNA-seq dataset."]
   [count :long
	 "Number of reads supporting the intron."]))

Homologies

These are a special case of lightweight features. See here.