README1: Overview - ksugi/german_textsegmentation GitHub Wiki

This is a German text segmentation tool (GETS).

GETS segments texts into tokens and sentences. It is based on Conditional Random Fields and is designed for modern German texts in private communication, where some orthographical deviations are expected. GETS showed a high accuracy for newspaper corpus (TüBA D/Z) and for postcard corpus (ANKO).

The detail is described in the paper: K. Sugisaki. Word and sentence segmentation in german: Overcoming idiosyncrasies in the use of punctuation in private communication. In Proceedings of the International Conference of the German Society for Computational Linguistics and Language Technology (GSCL), 2017 (to appear).

NOTE:

The CRF model I provide here is trained on TüBA D/Z 10 - which expected that the accuracy might be better than that in the paper.