Evaluation Results - HeidelTime/heideltime GitHub Wiki

Table of contents

Introduction

This page contains the evaluation results of version 2.2 of HeidelTime.

Operating system: Debian Linux

Java version: 1.8.0_101

Locale: en_GB (unless given in the workflow description in ReproduceEvaluationResults)

Tokenization and POS-Tagging: TreeTaggerWrapper, JVnTextProWrapper (Vietnamese corpora: JVnTextPro 2.0, Maxent model), StanfordPOSTaggerWrapper (Arabic corpora: Stanford POS Tagger 3.3.1, arabic.tagger model), HunPosTaggerWrapper (Croatian WikiWarsHR: HunPos 1.0, Croatian model from 09.05.2013)

ACE Tern 2004 Training Corpus

Precision Recall F-Score
Extraction (lenient) 95.8% 79.0% 86.6%
Extraction (strict) 87.3% 72.0% 78.9%
Normalization (value) 86.8% 87.3% 87.1%
Extraction & Normalization (lenient + VAL) 83.1% 68.6% 75.1%
Extraction & Normalization (strict + VAL) 78.2% 64.6% 70.7%

AncientTimes Arabic

Precision Recall F-Score
Extraction (strict) 83.33% 74.26% 78.53%
Extraction (relaxed) 93.33% 83.17% 87.96%
  • Attribute value F1: 83.77%
  • Attribute type F1: 87.96%

AncientTimes German

Precision Recall F-Score
Extraction (strict) 86.75% 71.98% 78.68%
Extraction (relaxed) 95.36% 79.12% 86.49%
  • Attribute value F1: 81.08%
  • Attribute type F1: 85.89%

AncientTimes English

Precision Recall F-Score
Extraction (strict) 88.85% 78.88% 83.57%
Extraction (relaxed) 97.03% 86.14% 91.26%
  • Attribute value F1: 84.97%
  • Attribute type F1: 90.56%

AncientTimes Spanish

Precision Recall F-Score
Extraction (strict) 80.85% 72.04% 76.19%
Extraction (relaxed) 96.28% 85.78% 90.73%
  • Attribute value F1: 85.71%
  • Attribute type F1: 88.22%

AncientTimes French

Precision Recall F-Score
Extraction (strict) 89.07% 77.19% 82.71%
Extraction (relaxed) 98.38% 85.26% 91.35%
  • Attribute value F1: 90.23%
  • Attribute type F1: 91.35%

AncientTimes Italian

Precision Recall F-Score
Extraction (strict) 79.63% 75.11% 77.3%
Extraction (relaxed) 91.2% 86.03% 88.54%
  • Attribute value F1: 79.55%
  • Attribute type F1: 85.84%

AncientTimes Dutch

Precision Recall F-Score
Extraction (strict) 81.67% 78.4% 80.0%
Extraction (relaxed) 94.17% 90.4% 92.24%
  • Attribute value F1: 88.16%
  • Attribute type F1: 88.16%

AncientTimes Vietnamese

Precision Recall F-Score
Extraction (strict) 87.27% 82.76% 84.96%
Extraction (relaxed) 97.27% 92.24% 94.69%
  • Attribute value F1: 92.04%
  • Attribute type F1: 93.81%

ACE Tern 2005 Corpus

Precision Recall F-Score
Extraction (lenient) 89.3% 75.5% 81.8%
Extraction (strict) 77.3% 65.3% 70.8%
Normalization (value) 74.8% 77.3% 76%
Extraction & Normalization (lenient + VAL) 66.8% 56.4% 61.2%
Extraction & Normalization (strict + VAL) 62.8% 53.1% 57.5%

Arabic test-150 Corpus

Precision Recall F-Score
Extraction (lenient) 80.1% 90.9% 85.2%
Extraction (strict) 64.9% 73.7% 69.0%

Arabic test-50 Corpus

Precision Recall F-Score
Extraction (lenient) 79.7% 90.4% 84.7%
Extraction (strict) 62.8% 71.3% 66.8%

Arabic test-50-star Corpus

Precision Recall F-Score
Extraction (lenient) 91.9% 91.3% 91.6%
Extraction (strict) 84.8% 84.2% 84.5%
Normalization (value) 91.9% 91.9% 91.9%
Extraction & Normalization (lenient + VAL) 84.5% 83.9% 84.2%
Extraction & Normalization (strict + VAL) 80.1% 79.5% 79.8%

Arabic test-50-star Corpus evaluated with TE3-Tools

Precision Recall F-Score
Extraction (strict) 80.99% 80.99% 80.99%
Extraction (relaxed) 90.91% 90.91% 90.91%
  • Attribute value F1: 82.23%
  • Attribute type F1: 84.3%

I-CAB Test Corpus

Precision Recall F-Score
Extraction (lenient) 92.7% 81.5% 86.8%
Extraction (strict) 64.1% 56.4% 60.0%
Normalization (value) 75.6% 78.3% 76.9%
Extraction & Normalization (lenient + VAL) 70.1% 61.7% 65.6%
Extraction & Normalization (strict + VAL) 51.4% 45.2% 48.1%

TempEval2 Evaluation Corpus

Precision Recall F-Score
88.0% 86.0% 87.0%
  • Attribute type: 96.0 %
  • Attribute value: 86.0 %

TempEval2 Spanish Evaluation Corpus

The Spanish TempEval2 Evaluation Corpus is essentially the same as TempEval 3 version further down in this document, but with some improvements, so please refer to that as it also uses our preferred evaluation method.

TempEval2 Italian Evaluation Corpus

Precision Recall F-Score
93.1% 89.6% 91.3%
  • Attribute type: 98.0 %
  • Attribute value: 94.0 %

TempEval 2 Italian Training Corpus evaluated with TE3-Tools

Precision Recall F-Score
Extraction (strict) 73.3% 88.72% 80.28%
Extraction (relaxed) 77.41% 93.69% 84.78%
  • Attribute value F1: 76.47%
  • Attribute type F1: 82.18%

TempEval 2 Italian Test Corpus evaluated with TE3-Tools

Precision Recall F-Score
Extraction (strict) 77.93% 89.68% 83.39%
Extraction (relaxed) 83.45% 96.03% 89.3%
  • Attribute value F1: 81.18%
  • Attribute type F1: 85.61%

TempEval 2 Chinese Original Training Corpora

Precision Recall F-Score
96.0% 93.9% 94.9%
  • Attribute type: 92.0 %
  • Attribute value: 79.0 %

TempEval 2 Chinese CLEAN Training Corpora

Precision Recall F-Score
80.1% 95.7% 87.2%
  • Attribute type: 94.0 %
  • Attribute value: 90.0 %

TempEval 2 Chinese IMPROVED Training Corpora

Precision Recall F-Score
97.4% 95.6% 96.5%
  • Attribute type: 94.0 %
  • Attribute value: 91.0 %

TempEval 2 Chinese Original Evaluation Corpora

Precision Recall F-Score
93.8% 87.5% 90.5%
  • Attribute type: 93.0 %
  • Attribute value: 70.0 %

TempEval 2 Chinese CLEAN Evaluation Corpora

Precision Recall F-Score
62.4% 91.8% 74.3%
  • Attribute type: 96.0 %
  • Attribute value: 89.0 %

TempEval 2 Chinese IMPROVED Evaluation Corpora

Precision Recall F-Score
95.8% 89.3% 92.4%
  • Attribute type: 96.0 %
  • Attribute value: 86.0 %

TimeBank 1.2 Corpus

Precision Recall F-Score
Extraction (lenient) 92.6% 91.5% 92.0%
Extraction (strict) 86.6% 85.6% 86.1%
Normalization (value) 87.6% 87.6% 87.6%
Extraction & Normalization (lenient + VAL) 81.0% 80.1% 80.6%
Extraction & Normalization (strict + VAL) 77.0% 76.2% 76.6%

WikiWars Corpus

Precision Recall F-Score
Extraction (lenient) 98.3% 86.1% 91.8%
Extraction (strict) 93.3% 81.8% 87.2%
Normalization (value) 90.5% 91.1% 90.8%
Extraction & Normalization (lenient + VAL) 89.0% 78.0% 83.1%
Extraction & Normalization (strict + VAL) 85.9% 75.3% 80.2%

WikiWarsDE Corpus

Precision Recall F-Score
Extraction (lenient) 98.7% 89.3% 93.8%
Extraction (strict) 92.6% 83.8% 88.0%
Normalization (value) 88.5% 88.5% 88.5%
Extraction & Normalization (lenient + VAL) 87.4% 79.1% 83.0%
Extraction & Normalization (strict + VAL) 83.2% 75.3% 79.1%

WikiWarsVN Corpus

Precision Recall F-Score
Extraction (lenient) 92.1% 97.8% 94.8%
Extraction (strict) 72.9% 77.4% 75.1%
Normalization (value) 95% 95% 95%
Extraction & Normalization (lenient + VAL) 87.5% 92.9% 90.1%
Extraction & Normalization (strict + VAL) 69.2% 73.5% 71.2%

WikiWarsVN Corpus evaluated with TE3-Tools

Precision Recall F-Score
Extraction (strict) 94.09% 94.09% 94.09%
Extraction (relaxed) 98.18% 98.18% 98.18%
  • Attribute value F1: 91.36%
  • Attribute type F1: 93.64%

WikiWarsHR Corpus evaluated with TE3-Tools

Precision Recall F-Score
Extraction (strict) 88.93% 86.86% 87.88%
Extraction (relaxed) 92.62% 90.46% 91.53%
  • Attribute value F1: 80.8%
  • Attribute type F1: 89.74%

Time4SCI Corpus

Precision Recall F-Score
Extraction (lenient) 96.2% 70.6% 81.4%
Extraction (strict) 88.9% 65.3% 75.3%
Normalization (value) 88.9% 88.9% 88.9%
Extraction & Normalization (lenient + VAL) 85.5% 62.8% 72.4%
Extraction & Normalization (strict + VAL) 80.0% 58.8% 67.7%

Time4SMS Corpus

Precision Recall F-Score
Extraction (lenient) 99.4% 91.3% 95.2%
Extraction (strict) 98.2% 90.2% 94.1%
Normalization (value) 97.1% 97.1% 97.1%
Extraction & Normalization (lenient + VAL) 96.5% 88.7% 92.4%
Extraction & Normalization (strict + VAL) 96.1% 88.3% 92.1%

TempEval 3 AQUAINT Training Corpus

Precision Recall F-Score
Extraction (strict) 80.99% 81.69% 81.34%
Extraction (relaxed) 92.12% 92.92% 92.52%
  • Attribute value F1: 73.09%
  • Attribute type F1: 84.44%

TempEval 3 TimeBank Training Corpus

Precision Recall F-Score
Extraction (strict) 86.4% 84.31% 85.34%
Extraction (relaxed) 93.08% 90.83% 91.94%
  • Attribute value F1: 79.56%
  • Attribute type F1: 89.66%

TempEval 3 trainT3 Spanish Training Corpus

Precision Recall F-Score
Extraction (strict) 90.83% 81.44% 85.88%
Extraction (relaxed) 96.33% 86.38% 91.08%
  • Attribute value F1: 84.14%
  • Attribute type F1: 89.54%

TempEval 3 Platinum English Evaluation Corpus

Precision Recall F-Score
Extraction (strict) 83.97% 79.71% 81.78%
Extraction (relaxed) 93.13% 88.41% 90.71%
  • Attribute value F1: 78.07%
  • Attribute type F1: 83.27%

TempEval 3 Spanish Evaluation Corpus

Precision Recall F-Score
Extraction (strict) 91.48% 80.9% 85.87%
Extraction (relaxed) 96.02% 84.92% 90.13%
  • Attribute value F1: 85.33%
  • Attribute type F1: 87.47%

French TimeBank 1.1 Corpus

Precision Recall F-Score
Extraction (strict) 86.81% 85.18% 85.99%
Extraction (relaxed) 91.85% 90.12% 90.97%
  • Attribute value F1: 73.63%
  • Attribute type F1: 82.66%

EVALITA 2014 Test Corpus

Precision Recall F-Score Type F1 Value F1
Strict extraction/normalization 85.1% 79.% 82.% 78.5% 71.%
Relaxed extraction/normalization 92.7% 86.1% 89.3% 84.% 75.%

Portuguese TimeBank 1.0 Corpus (test subset)

Precision Recall F-Score
Extraction (strict) 76.98% 66.9% 71.59%
Extraction (relaxed) 87.3% 75.86% 81.18%
  • Attribute value F1: 63.47%
  • Attribute type F1: 76.75%