document modeling - Tizra/Tizra-Customer-Tracker GitHub Wiki
Document Modeling in Tizra
These are a few notes on the kinds of representation that Tizra supports, and the kinds of extensions that we expect to support over time, specifically as they apply to the representation of publications. By describing some features in terms of their larger application, I am trying not only to show something of what we think the future holds, but to point out the scope of modeling that makes sense within Tizra.
Tizra was designed to work with long-form content, specifically books, and so is capable of dealing with content that can be subdivided, extracted, and regrouped. It is also designed specifically as a publication platform, not an editorial platform; this means that we do not attempt to represent all information about documents, rather we specifically address those aspects document structure that affect site design, content search, and access control. We do expect Tizra move beyond its current PDF-based content approach to include incoming data in XML, ePub and other formats, but we expect to continue to focus on the web and mobile publishing problem space.
This focus on long-form content and disaggregation of content makes Tizra somewhat different from a typical journal publishing system, where the most important requirements from a search and access control point of view can be resolved at the article level. This is true at a high level, and is not intended to imply a simplistic view of journal content: there are many reasons to represent articles in more detail (extraction of illustrations, specialized indexing, and reflowable content among them), and these are all things that expanded data formats would bring to Tizra as well.
Tizra modeling principles
Content modeling generally focuses on identifying the hierarchy of elements that compose a document, in a uniform, complete model that extends from highlighted words and meaningful components of bibliographic entries, up to large-scale components like Chapters, Sections, Volumes ... whatever makes sense for the particular document. This is almost always done in XML. Tizra's design as a publication system means that this kind of complete model is not actually needed (Pdf, in particular, lacks dependable access to much meaningful structure below the page level).
The primary use of structural information in Tizra is to control disaggregation of content, and access control to content. This means that the Tizra system operates at a larger level of granularity than, say, an XML database. The Tizra principle is that access control and presentation and marketing of content are the goals, rather than acting as an authoritative Metadata archive and management tool.
MetaObjects
The standard Tizra Metadata Object (MetaObject) is best thought of as a metadata record, with a variety of associated content. I use the term record advisedly, as Tizra metadata itself is flat, as described below. A MetaObject is the smallest possible granular unit of the access control system.
All metaObjects (potentially) consist of the following things:
-
Properties: Every metaObject has a set of (name,value) pairs that can be used to group objects into collections, be used as fields to control value or full-text search.
-
MetaSources: A MetaSource is a group of resources (files or URLs of content) associated with the object. Access control Licenses that grant access to a MetaObject can limit or grant access to the associated MetaSources by name. The contents of a metasource have filenames (used in their URLs), and display names (used when generating links to the resource).
Some metaSources are "processed" on publishing and each source file within such a source will be processed and may be broken down into multiple pages in the final site. Such files may also create additional metadata for their containing document on initial upload (or explict reprocessing request), including cross-reference links, and table of contents entries.
-
System behavior: MetaObjects are also used to represent certain things that are not publication objects. These objects have additional properties to control those behaviors. Such system object types include:
Excerptsa set of pages in a publication, defined by a list of page-ranges in that document,UserDataabout a user login,Offers(which define the price and access rights granted for a sale of content),Collectionswhich contain other objects as defined by a boolean predicate, andVirtual Collections, collections which are automatically created as needed based on content sharing values for a certain property.
-
tizra-id there is a unique database ID associated with each MetaObjecty.
-
URL An object's tizra-id can be used as a URL -- which will display one or more appropriate views. A custom URL can be defined, and will then be used by the system for all linking involving that object.
When source files are uploaded to a MetaSource that can be processed
MetaTypes
There is a type system associated with MetaObjects, and each MetaType is associated with a simple schema definition, enumerating the names and types of the objects properties, and the name of its MetaSources (and whether they contain content to be processed on publication, or are to be delivered as-is). The built-in metatypes are associated with their system behavior, but subtypes can be created. Subtypes inherit property and MetaSource definitions from their parents.
Metadata properties
There is no notion of a traditional hierarchical metadata record in Tizra. The primary metadata in the Tizra system is a simple property system. (Tizra could store records in MARC, XML, or other formats, but only as a single field) Any object that can have metadata has a list of named properties with values of various types (as described in this incomplete list):
-
keyworda string that is indexed as an unparsed primitive value; can be used to implement controlled vocabularies as described in Advanced Properties settings -
booleana value oftrueorfalse -
integera number (will sort numerically rather than as a string). Numbers are not normalized by the system, but are stored as they were entered. -
floatnumbers with fractional parts (they are actually sorted as fixed-point decimal strings internally) -
stringa string with tokenized words, indexed for full-text search (unlikekeywordvalues) -
textsame as string the string type, but intended for longer texts. -
htmlatextfield that is intended to hold HTML-tagged text. Administrators get an inline HTML editor for these fields, and text indexing might be extended to ignore tags, etc.
Most primitive value types can also appear as list types. The most important list types are:
-
keyword-listThis is a list of unparsed tokens, each indexed as a complete string for search. These are used throughout Tizra as the primary way of creating collections by means of boolean queries on combinations of tags. -
string-listThis is a list of tokenized strings. Used in Tizra for things like author lists, where there is a series of components, but each one may be tokenized for search (e.g. on last name) -
boolean-listis used in conjunction with value restrictions to implementKeyword Checklistproperties as described in Advanced Properties settings -
integer-list, etc. Boolean list is used in conjuction with
JSON Note All list types are represented by semicolon-separated escaped strings internally, but JSOn APIs return list format properties as JSON Arrays rather than encoded strings.
JSON-valued fields are available, and for use in freemarker templates, and in calculated values. They are not currently indexed by the search engine directly, so additional calculated properties may need to be created if you want to implement searching of complex field content.
This is the complete list of property types in Tizra: integer, integer-list, string, string-list, date, date-list, boolean, boolean-list, float, float-list, reference (not currently used Intended for cross-references at some point), reference-list not used, keyword, keyword-list, encrypted-url ** for amazon streaming content **, html, text, css-color, auto-uuid, isbn, isbn13, json-array, json-hash, json-value.
MetaSources
These represent assets associated with a document, which might be files to be presented as Downloadable attachments, HTML sub-sites, or to be processed as the content of a document (e.g. source PDF files). MetaSources can also contain URLs representing externally stored resources.
Book: A publication
The book metatype is a "normal", full-fledged Tizra publication, with User-managed metadata, and 3 MetaSources PdfSource, Attachments, and FreeAttachments. In regular Tizra usage, files in the PdfSource may be processed to extract metdata and PDF bookmarks (to populate the table of contents). At publication time, each file is broken up into pages to create the full text content of the document.
Associated with a book is a table of contents, ToC, which is a hierarchical index of pages in the document. While there is only one ToC today, we expect that it will make sense in the long run to use the same abstraction as a way to represent any kind of external index of the content of a Book, e.g. to create lists of Illustrations and tables, or even topic indexes.
A ToC entry consists of the following:
- a title a string which labels the referenced content,
- a ToC level the absolute nesting level of this entry,
- a visibility flag allows the entry to be omitted from the table of contents display (needed because the structures that come from publishers sometimes include extra nesting levels that may not make sense for online display).
- a page number the page number of the target page
- a logical page number We might use this at a later stage
- a set of props; these properties are only available via the API, and are discussed in the documentation of the publication api POST-source-file-api.
Tables of contents are currently extracted from the source files in the PdfSource when they are uploaded (or upon later administive request). Because page references are stable in PDF books, these are explicitly manageable objects, but that will not make sense for XML content.
The system allows users to use table of contents entries to automatically create Excerpts for the page ranges defined by the start and end of a hierarchy level in the table of contents.
Currently, the table of contents for a book is made up of the concatenation of the lists of TOC entries for each source file. To enable more flexible management of Tables of contents, we would like to implement a (Level offset) that would allow the TOC entries created by a source file to be nested rather than placed at the top level.
We also think that it may make sense to have an option that allows a metadata update to specify that a file should automatically define an Excerpt, and to specify the metadata that should be associated with that excerpt (allowing for additaional sub-part ToCs, for instance).
Page: The smallest fungible unit
Pages are derived from a Book when it is published, by processing the files in its processable MetaSource (the standard PdfSource). Each file in that MetaSource may produce several Pages each of which corresponds to a web page in the online published version of a document (for a PDF, it also corresponds with an original printed page). The Pages of a Book are numbered starting with 1 (a so-called physical page number). This is an unbroken, undifferentiated sequence; there are no breaks or structural divisions to mark separate files in the MetaSource.
When handling other data formats like XML, Tizra will preserve this basic notion of processing the source content to a publication form, breaking it into smaller units (if and as appropriate), and creating a numbered list of pages representing the output.
Associated with pages are:
- their parent document: (used for access control),
- a logical page number: (used where the user designation of a page might need to be different from the uniform numerical order created by the system) used for e.g. roman numeral pages in books,
- cross-references: Links to other content. These may be
- URLs of web content,
- attachment files in anothe MetaSource of the Page's parent MetaObject, or
- links to other pages (by page number). Since the final pagination of a document is not known until it is processed, these page links are an output of the publication process. Pdf links include their rectangle on the page, so that users can select them, other data formats would record the location within a page differently (if necessary). It is possible to associate link display information like new window targets with a cross reference. It is also possible to create a named internal anchor on a page as a special type of cross reference. This is implemented as a fragment-identifier.
- properties: as a full-fledged MetaObject, pages can have metadata, but currently they simply inherit values from their parents (except for the special case of their full-text content). Given their status as derived objects, we do not currently expect that there will be any metadata associated with pages that is not the result of a calculation carried out during publication.
Excerpt: a set of pages from a Book
An Excerpt is a MetaObject composed of a set of pages drawn from its parent MetaObject. Excerpts can have explicitly managed metadata, (though by default they inherit from their parent). This means that Excerpts can have their own Authors list or other metadata, and be indexed appropriately in Title indexes and collections.
The contents of an Excerpt is defined by a list of pages or page ranges (for example an excerpt defined by 1,5-6 would have 3 discontinuous pages). Excerpts are defined entirely by this page range, and are not constrained in any way by ToC entries or source file boundaries within their MetaSource.
Excerpts have their own URLs (which display a "Table of Contents page"). It's worth noting that any metadata can be displayed on this page, like chapter Authors and Abstract, for instance. A ToC per se need not actually be included. If tables of contents are not enabled for an excerpt, the system will automatically redirect accesses to its URL to the first page of the excerpt.
One feature that we have considered but not implemented is the idea of allowing page navigation within an Excerpt, so that if we had an excerpt define as before, by the expression: 1,5-6, and it had the custom URL excerpt then an access to excerpt/1 would deliver page one of the parent document (1 of 3), access to excerpt/2 would deliver page 5 of the parent document (as page 2 of 3), and excerpt/3 would deliver page 6 of the parent document. Any other page numbers would 404, as they would not exist.
Excerpts are part of access control, and licensing an excerpt to a user grants access to those pages that it includes (and no other parts of that document).
Excerpts also have their own Attachments MetaSource, and those attachments are offered to a user for download, when the user is viewing a page that is part of the excerpt.
Tizra's focus on publication and marketing rather than modelling shows up quite clearly in the excerpt concept, since Excerpts are completely non-hierarchical. They grant a great deal of flexibility in access control and in the ability to associate downloadable content with a particular set of pages in a document, but they pay a cost in modeling because they are not structurally linked to a hierachy like the table of contents.
In the future implementation of non-PDF content, some Excerpts may be treated as derived objects in the same way that Pages are, as it may make more sense to use structural information available during the publishing process to decide when to create Excerpts and what metadata to associate with them.