Observations - davidguzmanr/AmericasNLP-Machine-Translation GitHub Wiki

The dataset seems to be following Classical Nahuatl orthography.

Because of the orthographic discrepancies between authors, we should probably use a Soundex-esque phonetic conversion system.

A good first approach could be taking the conventions in Notes on Nahuatl Orthography.

Some notes about our development dataset:

  • Length in vowels is NOT considered. This means that for any extra corpora used, we must make sure to preprocess special vowels (for example, ā -> a).
  • w only seems to appear in non-English words in kwakwaltsin. hu and uh are more common forms of this consonant.
  • B and v are present only in loanwords.
  • y -> both y and i present.
  • s is ONLY written as z. S is present in loanwords only.
  • Saltillos ARE present (as h).
  • Punctuation is present as: Suspensive points, commas, interrogation symbols, names (I'll Fly). Period use is inconsistent (should probably exclude them).

Idea for the preprocessing algorithm:

  • For each token, first decide whether to exclude it or not if it looks like a loanword (Bill Gates, lasaña, "Blacc Eyed Pea"). (Maybe a RNN could be used for this purpose).
  • If we find some invalid spellings (by standards of the dev set), we try and normalize them.