Hyphenation - sc-voice/ms-dpd GitHub Wiki

Hyphenation

Lamentably, although our modern view now encompasses the world and beyond, our actual viewpoint has shrunk to fit rather claustrophobically in the palm of our hand. And if we wish to read while traveling, we often must rely on that tiny thing in our hand, our mobile phone. For English with its small words, a phone works fine. However for languages such as Pali that rely on composite words, the readability of text suffers on a phone.

For readability, we need to hyphenate.

attahita-parahita-ubhayahita-sabbalokahitameva

Hyphenation and Meaning

Sometimes a word is just the sum of its parts. "Birthday" is literally "birth+day".

However, in other cases meanings can shift subtly as we break down words. And this meaning shift can also be compounded by ambiguous composition. For example, the word saññānirodha could naively be split in several ways:

word meaning
saññānirodha ending of perception; cessation of recognition
saññāni-rodha perceiving; regarding (as); with perception
obstruction; obstacle; hazard
saññā-nirodha perceiving; regarding (as); with perception ...
ending (of); cessation (of); termination (of); finishing (of)...

Splitting words too much leads to a dilution of meaning (is it an "obstacle to perception" or "cessation of perception"?). And since meaning is so important in a dictionary, we should preserve meaning by being very conservative about hyphenation. For this reason, MS-DPD hyphenation is not exhaustive. MS-DPD hyphenation stops when the hyphenated parts are each:

  • in the MS-DPD Early Buddhist Text (EBT) dictionary
  • at or below maxLength
  • at or above minLength

Hyphenation with Dictionary Rigor

MS-DPD hyphenation requires that all hyphenated components be found in the dictionary. Hyphenation with dictionary rigor is simple, data-driven and efficient. It does lead to larger hyphenation components, but those larger components will all be in the dictionary with meanings more exact than hyphenations with smaller components.

Traditionally, hyphenation has been used by typesetters to maximize page usage by eliminating large gaps of white space on consecutive lines of a paragraph. Such gaps stand out as ugly and unprofessional. Excess white space also wastes paper. The solution is to hyphenate words to fill up the page. But is that the best thing to do for dictionaries?

MS-DPD is intended for use on mobile phones, which do not have any paper that needs to be saved. Because we do not need to save paper, we can afford to hyphenate with dictionary rigor. The user may need to scroll more, but what they get in return are hyphenation components that also have dictionary entries.

To see how hyphenating with dictionary rigor can yield coarser hyphenation than DPD, let's look at an example. The word sacchikiriyāya has one headword. However sacchi and kiriyāya have six headwords altogether. MS-DPD stops hyphenating as soon as a dictionary entry is found unless the hyphenation component is too large (See maxLength)

app hyphenation
DPD vijjā-vimutti-phala-sacchi-kiriyāya (finer)
MS-DPD vijjāvimutti-phala-sacchikiriyāya (coarser)

DPD hyphenation also uses sandhi rules to find hyphenation candidates. However, MS-DPD hyphenation currently does not use sandhi rules. Indeed, sometimes MS-DPD gives up entirely. There is clearly room for improvement:

app hyphenation
DPD pariyāya-bhatta-bhojanānuyogamanuyuttā (āy => a-ay)
MS-DPD pariyāyabhattabhojanānuyogamanuyuttā (āy => ?)

Hyphenating with dictionary rigor can introduce other differences:

app hyphenation
DPD vivekaja-pītisukha-sukhumasacca-saññīyeva (saññīyeva not in dictionary?)
MS-DPD vivekaja-pītisukha-sukhuma-saccasaññīyeva (saccasaññīyeva in dictionary)

Hyphenation will also fail if there are no dictionary entries (i.e., bhesajja-parikkhārahetu?):

app hyphenation
DPD cīvara-piṇḍapāta-senāsana-gilānappaccaya-bhesajja-parikkhāra-hetu
MS-DPD cīvara-piṇḍapāta-senāsana-gilānappaccayabhesajjaparikkhārahetu

Fortunately, as dictionary support increases, hyphenation with dictionary rigor will benefit as well.

API: hyphenate(word, opts)

MS-DPD hyphenation is provided by the Dictionary class. It has several options.

let dict = await Dictionary.create();
let parts = dict.hyphenate(word, opts);
parameter default description
maxLength 17 the preferred maximum length of a hyphenated component
minLength 5 the minimum word length of a hyphenated component
splitFactor 0.5 the position at which the word to be hyphenated is first split
strategy plain hyphenation strategy (plain, sandhi)

Example:

let opts = { maxLength:17, minLength:5, splitFactor:0.5 };
let word = "attahitaparahitaubhayahitasabbalokahitameva";
dict.hyphenate(word, opts);
// [ attahita, parahita, ubhayahita, sabbalokahitameva ]

This API is extensible and future versions of MS-DPD may include a sandhi hyphenation strategy.