ReproducibilityStandards - delph-in/docs GitHub Wiki
In order to facilitate comparison with and reproducibility of experiments using DELPH-IN data and tool sets, this page documents standard training and testing data sets for each grammar, and standard evaluation metrics and terminology. We encourage everyone to use the standards listed here, or to describe any deviations in terms of these standards.
Data
Evaluation Metrics
Coverage
- observed coverage: percentage of items that received at least one parse
- verified coverage: percentage of items for which a gold standard analysis was found during treebanking
Accuracy
It is important to specify whether these metrics are calculated over:
- all items in a data set
- all items that have a gold standard analysis
- all items that received a parse
- the intersection of the last two
metrics
-
exact match: percentage of items for which the top analysis was the gold analysis
-
DMRS