Case Study: RU - sc-voice/ms-dpd GitHub Wiki

Практический пример: русский язык

Overview

The Digital Pali Dictionary (DPD) is the sole source for MS-DPD EN definitions. Indeed, the EN DPD definitions are the basis for other MS-DPD languages whose translations are in progress (DE, PT, ES, FR).

However, the DPD is itself also multilingual and provides data tables for other languages (Russian, Sinhala).

This document addresses change management and translation issues related to one of those languages (i.e., Russian) that may affect DPD or MS-DPD:

  • What background knowledge do we need for this document?
  • How do we get RU DPD definitions into MS-DPD?
  • How do we update RU DPD when EN DPD definitions are updated or added?
  • How do we update RU MS-DPD when EN DPD definitions are updated or added?

Background

Following is a brief summary of knowledge may help the reader.

Early Buddhist Texts (EBT)

SuttaCentral.net bases its aligned translations on the Early Buddhist Texts (EBTs) found in the Mahāsańghīti. The term "aligned" refers to the process of aligning Pali documents using "segment identifiers" to distinguish translatable semantic units of text. For example, we have the segment identifier "mn8:1.1", which aligns all translations across multiple languages:

scid: mn8:1.1
 pli: Evaṁ me sutaṁ—
 ref: So habe ich es gehört: 
 en: So I have heard. 

SC-Voice.net

SC-Voice.net is a website that lets users read and hear suttas such as

mn8:1.1 Evaṁ me sutaṁ—
mn8:1.1 So I have heard.

Although originally developed for the sight-impaired, SC-Voice.net has become useful to the normally sighted as well, who find benefit in listening to the suttas in the manner with which they were originally spoken.

SC-Voice.net is multilingual and aligned, but others have benefited from using SC-Voice.net technology dedicated to particular languages:

MS-DPD

MS-DPD is a Javascript library developed for hosting DPD and its translations within SC-Voice.net. MS-DPD does not include the full DPD and is designed for web use as a library:

Application Lookup words Definition Headwords Data MB
DPD 1264999 80303 2027.88
MS-DPD 131735 61394 22.99

MS-DPD language specific data for each language is dynamically imported on demand and is about 3-4MB. MS-DPD is available in npm as multiple packages:

DPD Database

MS-DPD uses the DPD SQLite database as the exclusive source for DPD information. The tables currently used by MS-DPD comprise:

  • lookup
  • dpd_headwords
  • inflection_templates
  • (other tables may be included as needed)

DPD stores definition translations in other tables with similar but not identical schemas. For this document, we will focus on the RU translations. DPD stores Russian translations in a separate table tied to dpd_headwords:

CREATE TABLE russian (
        id INTEGER NOT NULL, 
        ru_meaning VARCHAR NOT NULL, 
        ru_meaning_raw VARCHAR NOT NULL, 
        ru_meaning_lit VARCHAR NOT NULL, 
        ru_notes VARCHAR NOT NULL, 
        PRIMARY KEY (id), 
        FOREIGN KEY(id) REFERENCES dpd_headwords (id)
);

To illustrate, let us examine the first definition for "dhamma". For EN, we have:

> select id, meaning_1, meaning_2, meaning_lit from dpd_headwords where id = 34626
id|meaning_1|meaning_2|meaning_lit
34626|nature; character|nature|

And for RU, we have:

> select id, ru_meaning, ru_meaning_raw, ru_meaning_lit, ru_notes from russian  where id = 34626
id|ru_meaning|ru_meaning_raw|ru_meaning_lit|ru_notes
34626|характер; качество; природа|||

From DeepL, we have the round-trip translation of RU ⇒ EN (note the subtle differences):

character; quality; nature|||

Importing RU for MS-DPD

Importing RU for the MS-DPD is straightforward and is handled by the build-dpd script that is executed with each DPD update. Since MS-DPD relies on the SQLite database published with each update, the script uses the following SQL to get the all the RU definitions:

 select id, ru_meaning meaning_1, ru_meaning_raw meaning_2, ru_meaning_lit meaning_lit from russian

The output of this SQL command is processed by sql-dpd.mjs to generate the RU MS-DPD definitions.

The latest version of MS-DPD CLI now understands RU:

RU devi image

Synchronizing DPD EN/RU definitions

SQL databases generally do not track changes to individual rows although SQL Lite does provide support for audit tables and triggers. The triggers would be on dpd-headwords and the trigger handlers could monitor changes to meaning_1, meaning_2, meaning_lit fields. This simple design could support the DPD translations included within DPD itself by maintaining a log of EN deletions, updates and additions. Individual translators could then consult the audit logs as a guide to keeping their own DPD language tables synchronized with the main dpd_headword table.

Synchronizing MS-DPD EN/RU definitions

The user experience for MS-DPD RU content is governed by several files maintained by different teams:

  • @sc-voice.net/ms-dpd/dpd/ru/definition-ru.mjs DPD/EBT headword content from russian updated entirely by build-dpd
  • @sc-voice.net/ms-dpd/dpd/ru/abbreviation-ru.mjs MS-DPD RU grammatical abbreviations updated entirely by build-dpd
  • @sc-voice.net/ebt-vue3/src/i18n/ru.ts web page user interface for RU (maintained by RU Tipitaka translators)

These files are maintained in different ways described in detail as follows.

definition-ru.mjs EN/RU Synchronized in DPD

If the DPD RU is always synchronized with DPD EN, then the process for MS-DPD RU updates is quite simple: overwrite the MS-DPD RU definitions with their updated versions from the russian table. Since the merges would happen in DPD, integrating them would be a simple automatic replacement of content of definition-ru.mjs with the latest content from russian.

abbreviation-ru.mjs EN/RU Synchronized in DPD

Given that RU abbreviations are already in DPD, we assume they are updated along with RU definitions.

MS-DPD abbreviations for RU are drawn from the lookup table fields: ru_abbrev, ru_meaning. Untranslated definitions are copied over from their EN counterpart. All RU abbreviations in MS-DPD are automatically generated fromlookup.

i18n/ru.ts RU Tipitaka translators

Different translations teams are working with SuttaCentral Bilara. I18n considerations for RU web content will be made directly by those RU teams and will not impact the MS-DPD RU translation team.

Other

meaning_raw

In the main DPD, the meaning_2 field holds definitions per Buddhadatta. In the Russian schema, the ru_meaning_raw field holds unreviewed definitions generated via AI. Both of these fields have the semantics of "unreviewed text", which can hold content from either source.

In MS-DPD definition files, unreviewed content is store in the meaning_raw column, which is only populated in the absence of content in the meaning_1 field.