Design V1 - project-octopus/octopodes GitHub Wiki

Documents

Provenance information is stored in a single CouchDB database. There are two basic document types, CreativeWork and WebPage. Each webpage may encapsulate one or more MediaObject files that represent ("encode") a work.

It feels backwards to group publications of a work on a web page, rather than having discrete publications, each of which might have a web page as a property, or some other locator/identifier. I'm also skeptical of relating each mediaobject to a work -- there can be many encodings of a single work on a page, and it can be hard to discover all or sometimes any of them. It should be possible to associate any number of media objects >= 0 with a publication. I guess >= 1 can be done in current schema by having lots of media objects, but the 0 case seems not possible.

Each document is equivalent to its Schema.org counterpart. For the sake of brevity, important properties such as name, creator, license, etc. are omitted from the example:

{                                                                           
  "docs": [
    {
      "_id": "document_1",
      "@id": "works/1",
      "@type": "CreativeWork",
      "url": "http://..."
    },
    {
      "_id": "document_2",
      "@id": "webpages/1",
      "@type": "WebPage",
      "hasPart": [
        {
          "@id": "works/1/mediaobject/1",
          "@type": "MediaObject",
          "contentUrl" : "http://...",
          "encodesCreativeWork": "works/1"
        }
      ],
      "url": "http://..."
    }
  ]
}

Each document will also carry a timestamp and the identity of the user who made the edit. There are two additional document types (not shown), User for a user's public profile, and Identity for a user's private info/password.

Example

The following example is presented in YAML for the sake of readability. It shows two works from the Rijksmuseum, and a single Wikipedia page containing two media objects which encode them.

I don't think CreativeWork should have a url. That is just another publication, or in this schema, a media object encoding the work in question would be under a web page document.

---
  docs:
    -
      doc_id: "document_1"
      id: "works/1"
      type: "CreativeWork"
      name: "The Threatened Swan, Jan Asselijn, c. 1650",
      url: "https://www.rijksmuseum.nl/en/collection/SK-A-4"
    -
      doc_id: "document_2"
      id: "works/2"
      type: "CreativeWork"
      name: "Muleteers beside an Italian Ruin, Jan Asselijn, c. 1650",
      url: "https://www.rijksmuseum.nl/en/collection/SK-C-89"
    -
      id: "webpages/1"
      type: "WebPage"
      name: "Jan Asselijn",
      url: "http://en.wikipedia.org/wiki/Jan_Asselijn"
      hasPart:
        -
          id: "webpages/1/mediaobject/1"
          type: "MediaObject"
          encodesCreativeWork: "works/1",
          name: "The Threatened Swan, one of the top works in the Rijk..."
          contentUrl: "http://upload.wiki...Swan.jpg",
        -
          id: "webpages/1/mediaobject/2"
          type: "MediaObject"
          encodesCreativeWork: "works/2",
          name: "Asselijn, Jan ~ Italian Landscape with the Ruins of a..."
          contentUrl: "http://upload.wiki...Aqueduct.jpg"

Discussion

It may be desirable to wrap each MediaObject in an additional CreativeWork so that another layer of information can be represented.

Shouldn't be done unless in fact there's another CreativeWork (eg adaptation of first one), in which cease there should be another CreativeWork document, and the mediaobject should refer to the second one.

Views

The current views are available using this schema:

  • Total number of creative works
  • Total number of web pages
  • Total number of edits (See Versioning below)
  • All domains for works or web pages (ignoring media files for the moment)

what is the domains for works query? hopefully not the url field of CreativeWork, which ought go.

  • All works or webpages on a particular domain
  • Everything on a particular web page
  • All related webpages and media files for a particular work
  • All edits for a particular work or webpage
  • All edits by a particular user

Examples

Output of views is based upon the above YAML document

Total works: 2
Total webpages: 1
Total edits: 3

All Domains:
* key: ["en.wikipedia.org"], value: 1
* key: ["rijksmuseum.nl"], value: 2

Everything from en.wikipedia.org
* doc_id: "document_3", key: ["en.wikipedia.org","webpages/1","Jan Asselijn"] value: 1

Everything from rijksmuseum.nl
* doc_id: "document_1", key: ["rijksmuseum.nl","works/1","The Threatened Swan, Jan Asselijn, c. 1650"], value :1
* doc_id: "document_2", key: ["rijksmuseum.nl","works/2","Muleteers beside an Italian Ruin, Jan Asselijn, c. 1650"], value: 1

Everything for The Threatened Swan
* doc_id: "document_1", key: ["works/1","works/1","The Threatened Swan, Jan Asselijn, c. 1650], value :1
* doc_id: "document_3", key: ["works/1","webpages/1/mediaobjects/1","The Threatened Swan, one of the topworks in the Rijksmuseum"], value :1

Discussion

Each "view" in CouchDB parlance is coded as a single mapreduce function pair, sacrificing ad-hoc queries in exchange for speed and scalability. We should design the schema and views with the following constraints as goals:

  • Most uses cases for data access should only require a single request to the database. Don't force the application to perform multiple requests to look up related information.
  • Make the keys for each view as informative as possible by using complex keys. For lists of results (like all matches for a particular domain), put all relevant information in the keys to avoid transferring entire documents over the wire.

Versioning

When a document is changed, we first archive the current representation by making a copy marked with a version number, and then inserting the new document. Most views will simply filter out any documents with the version marking, but history views can be keyed to these markings.

{                                                                           
  "docs": [
    {
      "_id": "document_1",
      "@id": "works/1",
      "@type": "CreativeWork",
      "name": "NEW TITLE",
      "url": "http://..."
    },
    {
      "_id": "document_1::v1",
      "@id": "works/1",
      "@type": "CreativeWork",
      "name": "OLD TITLE",
      "url": "http://..."
    }
  ]
}

Migrations

When migrating documents to a new schema, we should convert each document to the new format, and save the old format as an attachment.