What is the main disadvantage of BPE which BPE-Dropout tries to solve?
Bonus: In the Charagrams approach each word is represented by its character n-grams.
E.g. when using 4-grams up to 5-grams, word "unrelated" is represented as <w>unr, unre, nrel, rela, elat, late, ated, ted</w>, <w>unre, unrel, nrela, relat, elate, lated, ated</w> (actually just a subset of n-grams which are frequent in the training data).
Guess what are the (dis)advantages of BPE-Dropout vs. Charagrams.