Lemma Lattices - Hedera-Lang-Learn/hedera GitHub Wiki

The lemma lattices approach, outlined by James at the meeting on 2018-11-09, will allow referring of lexical items at whatever level of granularity is known or appropriate.

Benefits of this approach include:

  • handling of homographs (both before and after disambiguation)
  • optionally treating lexicalised inflected forms as first-class vocabulary items
  • optionally handling distinct word senses

Previous Descriptions

App Implementation

I'll build an app initially within Hedera but as this is relevant to many other things, I'll eventually break it out. The model for the lattices themselves could be as simple as:

  • pk
  • childset: set of foreign keys back to self

If the underlying database supports querying based on the child set, we avoid the need for a denormalised parent set although it may be necessary for performance to do pre-calculation of ancestors and descendants.

Any node-level information besides the lattice structure itself should preferably live elsewhere. This includes the lemma itself. In other words, if 100 is "ambiguous est" and 101 is "est when sum" and 102 is "est when edo" then those facts will be stored elsewhere but with reference to PKs 100, 101, and 102.

All that would be in the lemma lattices model would be:

pk childset
100 {101, 102}
101 {}
102 {}

An ambiguous "est" in a text would be tagged as 100 and if, say, disambiguated as a form of "sum", would then be retagged as 101.

"sum" as a lemma (as opposed to form) would also have a node. Let's say 50. Of course, "sum" the form could also have a node. Let's say 83.

Then we'd also have:

pk childset
50 {83, 101, ...other forms of "sum"... }

Note that we don't have to have a node for every form, only those we want to unambiguously refer to.

I've been talking as if the nodes are lemmas or forms (or potentially other things) but it occurs to me it's not quite correct to type them this way. 50 could be taken to mean "the lemma sum" OR it could be taken to mean "one of the forms of the lemma sum we just don't know/care which". It's not clear to me this is a distinction we need to make. I guess it's possible that we might want to attach some property to a lemma which it doesn't make sense to say applies to the forms of that lemma. We'll have to flesh out some of the nuances here.

We also may have notions of properties on nodes that are just defaults and which are overridden by children. Glosses are are great example of this where we may have a default for an entire lemma but override specific senses or even specific forms.

Note that inheritance of properties (including with how to resolve diamond conflicts) need not be baked into this app.