Evaluation Results - HeidelTime/heideltime GitHub Wiki

Introduction

Introduction

This page contains the evaluation results of version 2.2 of HeidelTime.

Operating system: Debian Linux

Java version: 1.8.0_101

Locale: en_GB (unless given in the workflow description in ReproduceEvaluationResults)

Tokenization and POS-Tagging: TreeTaggerWrapper, JVnTextProWrapper (Vietnamese corpora: JVnTextPro 2.0, Maxent model), StanfordPOSTaggerWrapper (Arabic corpora: Stanford POS Tagger 3.3.1, arabic.tagger model), HunPosTaggerWrapper (Croatian WikiWarsHR: HunPos 1.0, Croatian model from 09.05.2013)

ACE Tern 2004 Training Corpus

	Precision	Recall	F-Score
Extraction (lenient)	95.8%	79.0%	86.6%
Extraction (strict)	87.3%	72.0%	78.9%
Normalization (value)	86.8%	87.3%	87.1%
Extraction & Normalization (lenient + VAL)	83.1%	68.6%	75.1%
Extraction & Normalization (strict + VAL)	78.2%	64.6%	70.7%

AncientTimes Arabic

	Precision	Recall	F-Score
Extraction (strict)	83.33%	74.26%	78.53%
Extraction (relaxed)	93.33%	83.17%	87.96%

Attribute value F1: 83.77%
Attribute type F1: 87.96%

AncientTimes German

	Precision	Recall	F-Score
Extraction (strict)	86.75%	71.98%	78.68%
Extraction (relaxed)	95.36%	79.12%	86.49%

Attribute value F1: 81.08%
Attribute type F1: 85.89%

AncientTimes English

	Precision	Recall	F-Score
Extraction (strict)	88.85%	78.88%	83.57%
Extraction (relaxed)	97.03%	86.14%	91.26%

Attribute value F1: 84.97%
Attribute type F1: 90.56%

AncientTimes Spanish

	Precision	Recall	F-Score
Extraction (strict)	80.85%	72.04%	76.19%
Extraction (relaxed)	96.28%	85.78%	90.73%

Attribute value F1: 85.71%
Attribute type F1: 88.22%

AncientTimes French

	Precision	Recall	F-Score
Extraction (strict)	89.07%	77.19%	82.71%
Extraction (relaxed)	98.38%	85.26%	91.35%

Attribute value F1: 90.23%
Attribute type F1: 91.35%

AncientTimes Italian

	Precision	Recall	F-Score
Extraction (strict)	79.63%	75.11%	77.3%
Extraction (relaxed)	91.2%	86.03%	88.54%

Attribute value F1: 79.55%
Attribute type F1: 85.84%

AncientTimes Dutch

	Precision	Recall	F-Score
Extraction (strict)	81.67%	78.4%	80.0%
Extraction (relaxed)	94.17%	90.4%	92.24%

Attribute value F1: 88.16%
Attribute type F1: 88.16%

AncientTimes Vietnamese

	Precision	Recall	F-Score
Extraction (strict)	87.27%	82.76%	84.96%
Extraction (relaxed)	97.27%	92.24%	94.69%

Attribute value F1: 92.04%
Attribute type F1: 93.81%

ACE Tern 2005 Corpus

	Precision	Recall	F-Score
Extraction (lenient)	89.3%	75.5%	81.8%
Extraction (strict)	77.3%	65.3%	70.8%
Normalization (value)	74.8%	77.3%	76%
Extraction & Normalization (lenient + VAL)	66.8%	56.4%	61.2%
Extraction & Normalization (strict + VAL)	62.8%	53.1%	57.5%

Arabic test-150 Corpus

	Precision	Recall	F-Score
Extraction (lenient)	80.1%	90.9%	85.2%
Extraction (strict)	64.9%	73.7%	69.0%

Arabic test-50 Corpus

	Precision	Recall	F-Score
Extraction (lenient)	79.7%	90.4%	84.7%
Extraction (strict)	62.8%	71.3%	66.8%

Arabic test-50-star Corpus

	Precision	Recall	F-Score
Extraction (lenient)	91.9%	91.3%	91.6%
Extraction (strict)	84.8%	84.2%	84.5%
Normalization (value)	91.9%	91.9%	91.9%
Extraction & Normalization (lenient + VAL)	84.5%	83.9%	84.2%
Extraction & Normalization (strict + VAL)	80.1%	79.5%	79.8%

Arabic test-50-star Corpus evaluated with TE3-Tools

	Precision	Recall	F-Score
Extraction (strict)	80.99%	80.99%	80.99%
Extraction (relaxed)	90.91%	90.91%	90.91%

Attribute value F1: 82.23%
Attribute type F1: 84.3%

I-CAB Test Corpus

	Precision	Recall	F-Score
Extraction (lenient)	92.7%	81.5%	86.8%
Extraction (strict)	64.1%	56.4%	60.0%
Normalization (value)	75.6%	78.3%	76.9%
Extraction & Normalization (lenient + VAL)	70.1%	61.7%	65.6%
Extraction & Normalization (strict + VAL)	51.4%	45.2%	48.1%

TempEval2 Evaluation Corpus

Precision	Recall	F-Score
88.0%	86.0%	87.0%

Attribute type: 96.0 %
Attribute value: 86.0 %

TempEval2 Spanish Evaluation Corpus

The Spanish TempEval2 Evaluation Corpus is essentially the same as TempEval 3 version further down in this document, but with some improvements, so please refer to that as it also uses our preferred evaluation method.

TempEval2 Italian Evaluation Corpus

Precision	Recall	F-Score
93.1%	89.6%	91.3%

Attribute type: 98.0 %
Attribute value: 94.0 %

TempEval 2 Italian Training Corpus evaluated with TE3-Tools

	Precision	Recall	F-Score
Extraction (strict)	73.3%	88.72%	80.28%
Extraction (relaxed)	77.41%	93.69%	84.78%

Attribute value F1: 76.47%
Attribute type F1: 82.18%

TempEval 2 Italian Test Corpus evaluated with TE3-Tools

	Precision	Recall	F-Score
Extraction (strict)	77.93%	89.68%	83.39%
Extraction (relaxed)	83.45%	96.03%	89.3%

Attribute value F1: 81.18%
Attribute type F1: 85.61%

TempEval 2 Chinese Original Training Corpora

Precision	Recall	F-Score
96.0%	93.9%	94.9%

Attribute type: 92.0 %
Attribute value: 79.0 %

TempEval 2 Chinese CLEAN Training Corpora

Precision	Recall	F-Score
80.1%	95.7%	87.2%

Attribute type: 94.0 %
Attribute value: 90.0 %

TempEval 2 Chinese IMPROVED Training Corpora

Precision	Recall	F-Score
97.4%	95.6%	96.5%

Attribute type: 94.0 %
Attribute value: 91.0 %

TempEval 2 Chinese Original Evaluation Corpora

Precision	Recall	F-Score
93.8%	87.5%	90.5%

Attribute type: 93.0 %
Attribute value: 70.0 %

TempEval 2 Chinese CLEAN Evaluation Corpora

Precision	Recall	F-Score
62.4%	91.8%	74.3%

Attribute type: 96.0 %
Attribute value: 89.0 %

TempEval 2 Chinese IMPROVED Evaluation Corpora

Precision	Recall	F-Score
95.8%	89.3%	92.4%

Attribute type: 96.0 %
Attribute value: 86.0 %

TimeBank 1.2 Corpus

	Precision	Recall	F-Score
Extraction (lenient)	92.6%	91.5%	92.0%
Extraction (strict)	86.6%	85.6%	86.1%
Normalization (value)	87.6%	87.6%	87.6%
Extraction & Normalization (lenient + VAL)	81.0%	80.1%	80.6%
Extraction & Normalization (strict + VAL)	77.0%	76.2%	76.6%

WikiWars Corpus

	Precision	Recall	F-Score
Extraction (lenient)	98.3%	86.1%	91.8%
Extraction (strict)	93.3%	81.8%	87.2%
Normalization (value)	90.5%	91.1%	90.8%
Extraction & Normalization (lenient + VAL)	89.0%	78.0%	83.1%
Extraction & Normalization (strict + VAL)	85.9%	75.3%	80.2%

WikiWarsDE Corpus

	Precision	Recall	F-Score
Extraction (lenient)	98.7%	89.3%	93.8%
Extraction (strict)	92.6%	83.8%	88.0%
Normalization (value)	88.5%	88.5%	88.5%
Extraction & Normalization (lenient + VAL)	87.4%	79.1%	83.0%
Extraction & Normalization (strict + VAL)	83.2%	75.3%	79.1%

WikiWarsVN Corpus

	Precision	Recall	F-Score
Extraction (lenient)	92.1%	97.8%	94.8%
Extraction (strict)	72.9%	77.4%	75.1%
Normalization (value)	95%	95%	95%
Extraction & Normalization (lenient + VAL)	87.5%	92.9%	90.1%
Extraction & Normalization (strict + VAL)	69.2%	73.5%	71.2%

WikiWarsVN Corpus evaluated with TE3-Tools

	Precision	Recall	F-Score
Extraction (strict)	94.09%	94.09%	94.09%
Extraction (relaxed)	98.18%	98.18%	98.18%

Attribute value F1: 91.36%
Attribute type F1: 93.64%

WikiWarsHR Corpus evaluated with TE3-Tools

	Precision	Recall	F-Score
Extraction (strict)	88.93%	86.86%	87.88%
Extraction (relaxed)	92.62%	90.46%	91.53%

Attribute value F1: 80.8%
Attribute type F1: 89.74%

Time4SCI Corpus

	Precision	Recall	F-Score
Extraction (lenient)	96.2%	70.6%	81.4%
Extraction (strict)	88.9%	65.3%	75.3%
Normalization (value)	88.9%	88.9%	88.9%
Extraction & Normalization (lenient + VAL)	85.5%	62.8%	72.4%
Extraction & Normalization (strict + VAL)	80.0%	58.8%	67.7%

Time4SMS Corpus

	Precision	Recall	F-Score
Extraction (lenient)	99.4%	91.3%	95.2%
Extraction (strict)	98.2%	90.2%	94.1%
Normalization (value)	97.1%	97.1%	97.1%
Extraction & Normalization (lenient + VAL)	96.5%	88.7%	92.4%
Extraction & Normalization (strict + VAL)	96.1%	88.3%	92.1%

TempEval 3 AQUAINT Training Corpus

	Precision	Recall	F-Score
Extraction (strict)	80.99%	81.69%	81.34%
Extraction (relaxed)	92.12%	92.92%	92.52%

Attribute value F1: 73.09%
Attribute type F1: 84.44%

TempEval 3 TimeBank Training Corpus

	Precision	Recall	F-Score
Extraction (strict)	86.4%	84.31%	85.34%
Extraction (relaxed)	93.08%	90.83%	91.94%

Attribute value F1: 79.56%
Attribute type F1: 89.66%

TempEval 3 trainT3 Spanish Training Corpus

	Precision	Recall	F-Score
Extraction (strict)	90.83%	81.44%	85.88%
Extraction (relaxed)	96.33%	86.38%	91.08%

Attribute value F1: 84.14%
Attribute type F1: 89.54%

TempEval 3 Platinum English Evaluation Corpus

	Precision	Recall	F-Score
Extraction (strict)	83.97%	79.71%	81.78%
Extraction (relaxed)	93.13%	88.41%	90.71%

Attribute value F1: 78.07%
Attribute type F1: 83.27%

TempEval 3 Spanish Evaluation Corpus

	Precision	Recall	F-Score
Extraction (strict)	91.48%	80.9%	85.87%
Extraction (relaxed)	96.02%	84.92%	90.13%

Attribute value F1: 85.33%
Attribute type F1: 87.47%

French TimeBank 1.1 Corpus

	Precision	Recall	F-Score
Extraction (strict)	86.81%	85.18%	85.99%
Extraction (relaxed)	91.85%	90.12%	90.97%

Attribute value F1: 73.63%
Attribute type F1: 82.66%

EVALITA 2014 Test Corpus

	Precision	Recall	F-Score	Type F1	Value F1
Strict extraction/normalization	85.1%	79.%	82.%	78.5%	71.%
Relaxed extraction/normalization	92.7%	86.1%	89.3%	84.%	75.%

Portuguese TimeBank 1.0 Corpus (test subset)

	Precision	Recall	F-Score
Extraction (strict)	76.98%	66.9%	71.59%
Extraction (relaxed)	87.3%	75.86%	81.18%

Attribute value F1: 63.47%
Attribute type F1: 76.75%