Chinese Sub Char BPE - ufal/NPFL095 GitHub Wiki

Questions for Chinese-English NMT with Sub-Character BPE

1.

One of the main topics of the paper is BPE. Let’s see if you understand what is going on regarding that.

1.1

There is a passage in the text (section 4.2): “Different numbers of BPE operations (0, 500, 1,000, 2,000, 3,000 or 4,000) were applied…”

Q: assuming that there are 25 English letters that denote Chinese radicals in Wubi system, how many different subwords types will there be if 0 BPE operations are applied? If 4,000 operations are applied?

1.2

Suppose that you have an excerpt of the Chinese training corpus. It looks like this:

请讲...清论文...在青岛...从你和从他们 (the text is nonsense, I picked the characters with illustrative radicals).

Its Wubi representation is the following:

yge yfj . . . ige ywx yygy . . . d gef qynmu . . . ww wq t ww wb wu

Suppose that you need to make the BPE operations yourself. Do not take into account the ellipses (they just denote different phrases) and blank spaces.

Q: How would the first 4 operations on this input look like? Show each substitution by capital Latin letters (i.e., substitute the first pair to A, the second pair - to B, etc.)

Hint: some of the operations (3rd and 4th) can be swapped in order

2

Section 3.2. says: “For two similar languages, the joint BPE method can be used to generate the vocabularies for both the source and target language.”

Q: Does it make sense to use the joint BPE for English + Wubicized Chinese data? Why?

3

Q: According to the authors, why did they expect the speed of the wubi-based translation to be slower than the character-based translation? Were they right?

4

Q: For the comparison with Wubi encoding, the authors also applied the word-based, character-based and subword-based Chinese data. What does “subword” mean in this case?

Bonus Question:

5

In Chinese, there are approximately 80,000 characters, but only from 5,000 to 8,000 are used regularly. Moreover, no new characters can appear. (these statements are some simplification, but let us assume that.)

Q: from the theoretical perspective, why do we even want to bother ourselves going to sub-character level? Wouldn’t it be enough to just map each character to a number and that’d be it? What are the drawbacks of such approach?

Double Bonus Question (strictly for those who do not have anything else to do)

6

Interestingly, Japanese and Korean writing systems have significant differences. For example, Koreans currently use their own writing system, Hangul, which from the first sight seems similar to the Chinese one.

Q: Based on this short description from Wikipedia, would you say that applying such sub-character approach for Hangul is a good idea? Why (not)? If not - what can be an easier approach?