Heterogeneous Collection Design Proposal - abartov/bybeconv GitHub Wiki

Initial discussion

==heterogeneous collections design==

so far, we've worked on texts, not books. The system has no concept of a "book", or a collection of works. (It does have "Anthologies", but those are user-curated, not system-curated, and don't necessarily correspond to any authorial intent.)

books with single works (e.g. novels, monographs) had one manifestation

books with multiple works by the same author (a book of poems, or articles, or short stories) are just a bunch of manifestations that don't know about each other and don't know they are part of a collection. The only thing showing their belonging to a particular book and not another is their listing (in free markdown, not in DB terms) under a title in the author's TOC.

while this was a simplistic data model, we had gotten away with it until now.

we want to begin adding more complex kinds of works: anthologies (e.g. "the best articles of 1930", "memoirs of the first settlers of city X"), which are single books with individual works by multiple different authors, sometimes in different genres too.) and periodicals (i.e. publications that have a series of issues, each of which is like an anthology, having multiple works by different authors, genres etc.)

finally, it is time to distinguish between primary and secondary texts. Secondary texts are forewords, introductions, afterwords, notes from the publisher, editor's preface to a magazine issue, etc. These texts need to be preserved and made accessible when browsing a periodical's issue, for example, but should not appear in the list of texts (/works). They should be included in full-text search, but not in facet search without a keyword search.

===What is needed===

A thorough review of the data model, and a proposed design that would accommodate these more complex entities. Once we come up with and refine a robust data model, we can proceed to an implementation and migration plan.

Relations refactoring

First proposal I can make here is to convert existing Work-Expression and Expression-Manifestation relations from existing many-to-many to one-to-many. This will simplify existing data model and will allow us to drop two tables:

expressions_manifestations
expressions_works

We will add expressions.work_id and manifestation.expression_id columns instead. This should be relatively simple change, and I believe it will allow reduce load on database as well, as we'll need to do fewer joins.

@abartov says: yes, let's do this. This means giving up on the maximalist interpretation of the possible relations between works and expressions (viz. that a single Expression can be expressing more than one work), but that seems unrealistic at this point. Instead, a single text that is expressing more than one work will be implemented as a composite expression, with each component expression still expressing one work.

Splitting Manifestation and Text entities.

As we discussed over phone, Manifestation is a term representing 'actual printed book', while in Bybeconv project it represents actual text. We can split it in to two tables:

'Manifestation' to keep information about actual printed books, related to given Expression.
'Text' (or propose other name for it). It should contain actual Markdown.

And in this case we will move publishing info like 'publisher', 'publishing_year', etc. from Expression to Manifestation entity.

In theory we can gather information about all publications, including given manifestation (e.g. this book was issued in 2010 by Publisher 1, and in 2012 by Publisher 2, etc). Not sure if this information is actually required, but anyway...

As a sidenote I want to mention, that currently all records in Manifestation table have publisher = 'פרויקט בן-יהודה', and I'm not sure if this makes any sense at all. We can consider storing actual publishers and publication_years in this modified Manifestations table instead.

Also, we can consider to avoid adding 'Text' entity, and move markdown to Expression table instead. My point here is that our markdown is actually 'electronic version' of Expression, not of Manifestation. But we can consider keeping link to actual Manifestation (printed book), this text was grabbed from. And maybe augment it with data about who grabbed it.

@abartov says: this makes sense. Technically, electronic publishing is publishing, so FRBR would consider our online version a manifestation, so that's how I originally designed it, and that's why all our Manifestations have the project name as publisher. But this perhaps only makes sense for other catalogs or external entities describing our electronic resources, and for us, internally, is rather unnecessary. So yes, I am in favor of moving the original publisher information and markdown to the Expression entity (which we may want to rename to Text, since it would no longer quite correspond to the FRBR concept of expression, which is separated from manifestation because the exact same expression of a text can be published many times in different books, but we don't care about it, as we are not presuming to map all of Hebrew bibliography, only the source editions for our own electronic editions); this entity will now represent our electronic edition of the particular text, but also record its provenance from a printed edition, which could be done via a foreign key from a (re-purposed) Manifestation entity.

Planning this change would be complex, as Expressions and Manifestations are all over the codebase, and there are practically no tests, except for the partial coverage provided by the API work.

Remove genre from Expression

This is actually something that puzzles me from the start. Why do we have genre defined both at Work and Expression entities?

I believe we can keep it in Work, and drop column from Expressions table.

@abartov says: agreed. There is no real reason to have it at the Expression, as what defines a genre is inherent in the Work, regardless of any Expressions it may have. (And, say, a prose adaptation of Evgeny Onegin would be a different Work, not just an Expression of the original Work.) This was some early laziness to save some JOINs when all that was needed from the Work was the genre, but it violates DRY, and should be refactored.

Adding CompositeExpression entity

First of all I want to note, I'm not 100% sure if this should be CompositeWork or CompositeExpression. But I assume, that CompositeExpression will allow us to implement minimal required functionality. And if we'll go CompositeWork way, we will need CompositeExpression anyway.

Initially I wanted to create separate table CompositeExpression for this:

CompositeExpression should present complex works consisting of several independent works, this could be:

multivolume books, e.g. Lord Of The Rings (consisting of three volumes)

almanachs, anthologies, collections, etc. consisting of separate works by different authors

periodical issues (but it will require additional work later).

As a minimum, every CompositeExpression must have following attributes:

CompositeExpression

string Name

enum Type (we need to come up with possible values, from our discussion I can propose: multivolume, poetry collection, anthology)

has_many :composite_expression_items

CompositeExpressionItem will be a separate table used to implement many-to-many relation between regular Expressions and CompositeExpression. But it also should have additional columns:

CompositeExpressionItem

composite_expression_id

expression_id

order - order of individual expresion in this CompositeExpression

type - enum (primary, secondary) used to distinquish

But later I realized there could be nested CompositeExpressions, e.g. 'Book of collected works of Author X', and inside this book there could be some collection of short stories, etc.

So, maybe better approach will be to add 'composite' boolean flag to Expressions table instead. If expression is composite, it will not have a 'Text', but may have a 'Manifestation'. Instead of Text, it will have collection of 'CompositeExpressionItems'. This way we can achieve nested compsite expressions functionality.

@abartov says: yes, I think we want to support nested composition, not just for the scenario you described, but eventually also for modeling single texts that are composite by nature -- for example, in the literary journals content we are aiming to begin to process, there are quite a few cases where a translator contributes an article plus translations as an item in the journal.

For example, a few introductory paragraphs (authored by the translator, not translated), then a translated poem, where the author is X and the translator is this article/item's author, then more explanatory text, then another poem, etc. Ultimately, we would want to be able to model this, so that:

the full item is cataloged and discoverable by its title, genre (article/essay), author (the translator), period, etc.
each included poem is cataloged and discoverable by its title, genre (poetry), author (the original author), source language, etc.

And since this composite item would itself be part of an issue (of a journal), nesting seems necessary.

The one thing still bothering me is sequencing: theoretically, most sequencing is relevant at the Work level. A four-part sonnet sequence, for example, or The Lord of the Rings, whose order is determined by the author. However, in derivative publications -- especially translations -- there may well be a different order. For instance, a translator may offer, in a single issue of a literary journal, four selected sonnets by Shakespeare, and they are Sonnet 130, Sonnet 64, Sonnet 2, and Sonnet 17, in that order. This is obviously different from the sequence that may be defined at the Work level.

So I am wondering: Should we implement sequencing at the Work level at all?

Pragmatically, we may adopt the position that we don't care about modeling the ideal sequencing at the Work level, only the actual sequencing in a given text (Expression). If it's a sonnet cycle originally in Hebrew, presumably the Expression-level sequence would equal the Work-level sequence; and if it's some other selection, like the example above, then that would be the sequence. In other words, it may not be a realistic use case to let users follow the Work-level sequence per se (what if we don't have any complete translation of Shakespeare's sonnets, only various selections?).

I'm leaning to this pragmatic approach, but welcome your thoughts.