Locatables - WormBase/db-prototypes GitHub Wiki

(schema locatable
 (fields

   ;;
   ;; core attributes, used for all features
   ;;

   [parent :ref
	  "An entity (e.g. sequence or protein) which defines the coordinate system for this locatable."
   [min :long :indexed
	  "The lower bound of a half-open (UCSC-style) interval defining the location."
   [max :long :indexed
	  "The upper bound of a half-open (UCSC-style) interval defining the location.
   [strand :enum [positive negative]
	  "Token designating the strand or orientation of this feature.  Omit if unknown or irrelevant."
   [method :ref
	  "Method entity defining the meaning of this feature.  Required for lightweight features."]

   ;;
   ;; Attributes from ?Feature_data and #Feature_info -- used for lightweight features
   ;;

   [score :float
	  "Feature score, as used in ?Feature_data."]
   [note :string :many
	  "Human-readable note associated with a lightweight feature."]


   ;;
   ;; Binning system
   ;;

   [murmur-bin :long :indexed
	  "Bottom 20 bits contain a UCSC/BAM-style bin number.  High bits contain a Murmur3 hash code
	   for the parent sequence.  Only used for locatables attached to a parent with a :sequence/id."]

   ;;
   ;; Assembly support
   ;;

   [assembly-parent :ref
	  "The parent sequence in a genome assembly."]
   ))

Objects which are currently referenced via S_child tags will have a (at minimum), :locatable/parent, :locatable/min, and :locatable/max attributes, e.g.:

{:gene/id             "WBGene00001488"
 :gene/public-name    "frm-1"
 :gene/version        1
 :locatable/parent    [:sequence/id "CHROMOSOME_I"]
 :locatable/min       14875783
 :locatable/max       14900330
 :locatable/strand    :locatable.strand/negative
 :locatable/method    [:method/id "Gene"]
 :gene/reference      [:paper/id "WBPaper00024261"](/WormBase/db-prototypes/wiki/:paper/id-"WBPaper00024261")
 ;; etc.}

Note that a locatable has exactly one location (like an ACeDB S_child). It also has a contiguous span (like a single GFF record). Any substructure must be represented separately. For now, I propose that we leave the current structure of Gene/Transcript/CDS/etc. alone, and represent exon structures as they currently are, using the :transcript/source-exons attribute (see transcript/ZC247.3 for an example). Per discussions about representing complex loci and outrons, we might want to change this a bit in the future, but I'd like to concentrate on one thing at a time!

Binning

As with most non-geospatial databases, Datomic doesn't have a natural way to do interval queries (i.e. "give me the genes between 1Mb and 2Mb along CHROMOSOME_I"). To support this kind of query, every locatable attached to a Sequence object has a :locatable/murmur-bin attribute. This consists of:

  • A sequence location bin, calculated as in the BAM or Tabix specifications ORed with
  • The Murmur3 hash code of the parent sequence's name (shifted left 20 bits).

See also, the wb.binning namespace.

Lightweight features

"Lightweight features" are currently stored as rows of ?Feature_data objects. In the Datomic world, I don't think there's a great deal of benefit in boxcarring features into big feature-data containers -- instead, I suggest that each one should be a standalone entity using attributes from the locatable namespace. Note that Datomic is a lot more comfortable with the idea of "anonymous" objects than ACeDB. All the ACeDB-derived objects have primary identifiers stored in attributes like :feature/id or :gene/id, but this isn't required. Lightweight features which just have an internal Datomic entity ID (an opaque 64-bit integer), which will hopefully never be seen outside the database.

{:locatable/parent      [:sequence/id "2L52"]
 :locatable/min         1362
 :locatable/max         1475
 :locatable/method      [:method/id "modENCODE_2431_TF_HLH-1_Stage_embryo"]
 :locatable/score       74.4599
 :locatable/note        ["ChIP-Seq TF binding region for HLH-1"
						 "modENCODE ID 2431 (WBPaper00045904)"]}

Splice site confirmations ("Splice" records within ?Feature_data) are a special case of lightweight features. They can be represented using core locateable attributes, plus:

(schema splice-confirm
 (fields
  [cdna :ref
	 "cdna entity which supports this intron."]
  [est :ref
	 "sequence entity of an EST which supports this intron."]
  [ost :ref
	 "sequence entity of an OST which supports this intron."]
  [rst :ref
	 "sequence entity of an RST which supports this intron."]
  [mrna :ref
	 "sequence entity of an mRNA which supports this intron."]
  [utr :ref
	 "sequence entity of a UTR which supports this intron."]
  [rnaseq :ref :component
	 "Details of RNA-seq data supporting this intron (uses splice-confirm.rna namespace)."]
  [mass-spec :ref
	 "mass-spec-peptide entity which supports this intron."]
  [homology :string
	 "accession number of an external database record which supports this intron (is this used?)."]
  [false-splice :ref
	 "sequence entity providing evidence for a false splice site call."]
  [inconsistent :ref
	 "sequence entity providing evidence for an inconsistent splice site call."]))

(schema splice-confirm.rnaseq
 (fields
   [analysis :ref
	 "Analysis entity describing the RNA-seq dataset."]
   [count :long
	 "Number of reads supporting the intron."]))

Homologies

These are a special case of lightweight features. See here.