Observations - davidguzmanr/AmericasNLP-Machine-Translation GitHub Wiki
The dataset seems to be following Classical Nahuatl orthography.
Because of the orthographic discrepancies between authors, we should probably use a Soundex-esque phonetic conversion system.
A good first approach could be taking the conventions in Notes on Nahuatl Orthography.
Some notes about our development dataset:
- Length in vowels is NOT considered. This means that for any extra corpora used, we must make sure to preprocess special vowels (for example, ā -> a).
- w only seems to appear in non-English words in kwakwaltsin. hu and uh are more common forms of this consonant.
- B and v are present only in loanwords.
- y -> both y and i present.
- s is ONLY written as z. S is present in loanwords only.
- Saltillos ARE present (as h).
- Punctuation is present as: Suspensive points, commas, interrogation symbols, names (I'll Fly). Period use is inconsistent (should probably exclude them).
Idea for the preprocessing algorithm:
- For each token, first decide whether to exclude it or not if it looks like a loanword (Bill Gates, lasaña, "Blacc Eyed Pea"). (Maybe a RNN could be used for this purpose).
- If we find some invalid spellings (by standards of the dev set), we try and normalize them.