200 Languages - HeidelTime/heideltime GitHub Wiki

Table of contents

Automatically Created Resources for 200+ Languages

Starting with version 2.0, HeidelTime contains automatically created resources for 200+ languages. These language resources are contained in HeidelTime's "resources" folder and are named with "auto-x".

For linguistic preprocessing, we developed the AllLanguagesTokenizer, a simple (whitespace-based) yet generic tool that can be used with all the languages and that works ok for languages with whitespace tokenization.

A description of how we developed these resources can be found in our EMNLP'15 paper and the EMNLP'15 poster.

Obviously, temporal tagging of most of these languages was never addressed before so that HeidelTime its automatically created language resources can be considered as a baseline temporal tagger for all these languages. Of course, the temporal tagging quality is lower when using automatically created language resources, and the quality between the languages differs significantly.

As described in the paper, we performed an evaluation on publicly available temporally annotated corpora and the results are promising - although there are of course some issues that need to be addressed, e.g., tokenization of non-whitespace languages. For languages without annotated corpora, an evaluation is not directly possible. Thus, we provide an overview of the pattern translation completeness statistics for all the languages. A first version of this plot can be found here: auto-language completeness statistics