Chinese Sub Char BPE - ufal/NPFL095 GitHub Wiki
Questions for Chinese-English NMT with Sub-Character BPE
1.
One of the main topics of the paper is BPE. Let’s see if you understand what is going on regarding that.
1.1
There is a passage in the text (section 4.2): “Different numbers of BPE operations (0, 500, 1,000, 2,000, 3,000 or 4,000) were applied…”
Q: assuming that there are 25 English letters that denote Chinese radicals in Wubi system, how many different subwords types will there be if 0 BPE operations are applied? If 4,000 operations are applied?
1.2
Suppose that you have an excerpt of the Chinese training corpus. It looks like this:
请讲...清论文...在青岛...从你和从他们 (the text is nonsense, I picked the characters with illustrative radicals).
Its Wubi representation is the following:
yge yfj . . . ige ywx yygy . . . d gef qynmu . . . ww wq t ww wb wu
Suppose that you need to make the BPE operations yourself. Do not take into account the ellipses (they just denote different phrases) and blank spaces.
Q: How would the first 4 operations on this input look like? Show each substitution by capital Latin letters (i.e., substitute the first pair to A, the second pair - to B, etc.)
Hint: some of the operations (3rd and 4th) can be swapped in order
2
Section 3.2. says: “For two similar languages, the joint BPE method can be used to generate the vocabularies for both the source and target language.”
Q: Does it make sense to use the joint BPE for English + Wubicized Chinese data? Why?
3
Q: According to the authors, why did they expect the speed of the wubi-based translation to be slower than the character-based translation? Were they right?
4
Q: For the comparison with Wubi encoding, the authors also applied the word-based, character-based and subword-based Chinese data. What does “subword” mean in this case?
Bonus Question:
5
In Chinese, there are approximately 80,000 characters, but only from 5,000 to 8,000 are used regularly. Moreover, no new characters can appear. (these statements are some simplification, but let us assume that.)
Q: from the theoretical perspective, why do we even want to bother ourselves going to sub-character level? Wouldn’t it be enough to just map each character to a number and that’d be it? What are the drawbacks of such approach?
Double Bonus Question (strictly for those who do not have anything else to do)
6
Interestingly, Japanese and Korean writing systems have significant differences. For example, Koreans currently use their own writing system, Hangul, which from the first sight seems similar to the Chinese one.
Q: Based on this short description from Wikipedia, would you say that applying such sub-character approach for Hangul is a good idea? Why (not)? If not - what can be an easier approach?