Participle - Potato-W/Poetic_Language GitHub Wiki

Word segmentation is the foundation of NLP, whether later it is to do emotional analysis or content understanding or anything else. So, a good beginning is half the battle.

Modern Chinese word segmentation is a huge challange, not to motion Middle Chinese. The meaning of a word(字) forming a phrase (词)maybe different from that of a word. There is my strategy is as follows:

  1. first, participle with jieba to have most words.
  2. second, using informationentropy to get unregistered words, which are more common in Tang Poem, rather than modern Chinese.

jieba

informationentropy