Case Study: RU - sc-voice/ms-dpd GitHub Wiki
Практический пример: русский язык
Overview
The Digital Pali Dictionary (DPD) is the sole source for MS-DPD EN definitions. Indeed, the EN DPD definitions are the basis for other MS-DPD languages whose translations are in progress (DE, PT, ES, FR).
However, the DPD is itself also multilingual and provides data tables for other languages (Russian, Sinhala).
This document addresses change management and translation issues related to one of those languages (i.e., Russian) that may affect DPD or MS-DPD:
- What background knowledge do we need for this document?
- How do we get RU DPD definitions into MS-DPD?
- How do we update RU DPD when EN DPD definitions are updated or added?
- How do we update RU MS-DPD when EN DPD definitions are updated or added?
Background
Following is a brief summary of knowledge may help the reader.
Early Buddhist Texts (EBT)
SuttaCentral.net bases its aligned translations on the Early Buddhist Texts (EBTs) found in the Mahāsańghīti. The term "aligned" refers to the process of aligning Pali documents using "segment identifiers" to distinguish translatable semantic units of text. For example, we have the segment identifier "mn8:1.1", which aligns all translations across multiple languages:
scid: mn8:1.1
pli: Evaṁ me sutaṁ—
ref: So habe ich es gehört:
en: So I have heard.
SC-Voice.net
SC-Voice.net is a website that lets users read and hear suttas such as
Although originally developed for the sight-impaired, SC-Voice.net has become useful to the normally sighted as well, who find benefit in listening to the suttas in the manner with which they were originally spoken.
SC-Voice.net is multilingual and aligned, but others have benefited from using SC-Voice.net technology dedicated to particular languages:
- (DE) https://dhammaregen.net
- (FR) https://fr.sc-voice.net
- (ES, PT websites planned)
MS-DPD
MS-DPD is a Javascript library developed for hosting DPD and its translations within SC-Voice.net. MS-DPD does not include the full DPD and is designed for web use as a library:
Application | Lookup words | Definition Headwords | Data MB |
---|---|---|---|
DPD | 1264999 | 80303 | 2027.88 |
MS-DPD | 131735 | 61394 | 22.99 |
MS-DPD language specific data for each language is dynamically imported on demand and is about 3-4MB. MS-DPD is available in npm as multiple packages:
- @sc-voice/[email protected] (Pali lookup and general code)
- @sc-voice/[email protected] (language specific library)
- @sc-voice/[email protected] (language specific library)
- @sc-voice/[email protected] (language specific library)
- @sc-voice/[email protected] (language specific library)
- @sc-voice/[email protected] (language specific library)
- @sc-voice/[email protected] (language specific library)
DPD Database
MS-DPD uses the DPD SQLite database as the exclusive source for DPD information. The tables currently used by MS-DPD comprise:
- lookup
- dpd_headwords
- inflection_templates
- (other tables may be included as needed)
DPD stores definition translations in other tables with similar but not identical schemas.
For this document, we will focus on the RU translations.
DPD stores Russian translations in a separate table tied to dpd_headwords
:
CREATE TABLE russian (
id INTEGER NOT NULL,
ru_meaning VARCHAR NOT NULL,
ru_meaning_raw VARCHAR NOT NULL,
ru_meaning_lit VARCHAR NOT NULL,
ru_notes VARCHAR NOT NULL,
PRIMARY KEY (id),
FOREIGN KEY(id) REFERENCES dpd_headwords (id)
);
To illustrate, let us examine the first definition for "dhamma". For EN, we have:
> select id, meaning_1, meaning_2, meaning_lit from dpd_headwords where id = 34626
id|meaning_1|meaning_2|meaning_lit
34626|nature; character|nature|
And for RU, we have:
> select id, ru_meaning, ru_meaning_raw, ru_meaning_lit, ru_notes from russian where id = 34626
id|ru_meaning|ru_meaning_raw|ru_meaning_lit|ru_notes
34626|характер; качество; природа|||
From DeepL, we have the round-trip translation of RU ⇒ EN (note the subtle differences):
character; quality; nature|||
Importing RU for MS-DPD
Importing RU for the MS-DPD is straightforward and is handled by the build-dpd
script that is executed with each DPD update.
Since MS-DPD relies on the SQLite database published with each update, the script uses the following SQL to get the all the RU definitions:
select id, ru_meaning meaning_1, ru_meaning_raw meaning_2, ru_meaning_lit meaning_lit from russian
The output of this SQL command is processed by sql-dpd.mjs to generate the RU MS-DPD definitions.
The latest version of MS-DPD CLI now understands RU:
Synchronizing DPD EN/RU definitions
SQL databases generally do not track changes to individual rows although SQL Lite does provide support for audit tables and triggers.
The triggers would be on dpd-headwords
and the trigger handlers could monitor changes to meaning_1, meaning_2, meaning_lit
fields.
This simple design could support the DPD translations included within DPD itself by maintaining a log of EN deletions, updates and additions.
Individual translators could then consult the audit logs as a guide to keeping their own DPD language tables synchronized with the main dpd_headword
table.
Synchronizing MS-DPD EN/RU definitions
The user experience for MS-DPD RU content is governed by several files maintained by different teams:
- @sc-voice.net/ms-dpd/dpd/ru/definition-ru.mjs DPD/EBT headword content from
russian
updated entirely bybuild-dpd
- @sc-voice.net/ms-dpd/dpd/ru/abbreviation-ru.mjs MS-DPD RU grammatical abbreviations updated entirely by
build-dpd
- @sc-voice.net/ebt-vue3/src/i18n/ru.ts web page user interface for RU (maintained by RU Tipitaka translators)
These files are maintained in different ways described in detail as follows.
definition-ru.mjs EN/RU Synchronized in DPD
If the DPD RU is always synchronized with DPD EN, then the process for MS-DPD RU updates is quite simple: overwrite the MS-DPD RU definitions with their updated versions from the russian
table. Since the merges would happen in DPD, integrating them would be a simple automatic replacement of content of definition-ru.mjs
with the latest content from russian
.
abbreviation-ru.mjs EN/RU Synchronized in DPD
Given that RU abbreviations are already in DPD, we assume they are updated along with RU definitions.
MS-DPD abbreviations for RU are drawn from the lookup
table fields: ru_abbrev
, ru_meaning
. Untranslated definitions are copied over from their EN counterpart. All RU abbreviations in MS-DPD are automatically generated fromlookup
.
i18n/ru.ts RU Tipitaka translators
Different translations teams are working with SuttaCentral Bilara. I18n considerations for RU web content will be made directly by those RU teams and will not impact the MS-DPD RU translation team.
Other
meaning_raw
In the main DPD, the meaning_2
field holds definitions per Buddhadatta.
In the Russian schema, the ru_meaning_raw
field holds unreviewed definitions generated via AI.
Both of these fields have the semantics of "unreviewed text", which can hold content from either source.
In MS-DPD definition files, unreviewed content is store in the meaning_raw
column, which is only populated in the absence of content in the meaning_1
field.