FAQs - adobe-research/deft_corpus GitHub Wiki

  1. What is this DeftEval thing? Where can I find more information about SemEval shared tasks?

DeftEval is Task 6 for the 2020 SemEval shared tasks. If you would like more information about how to join and participate, you can find out more here.

  1. I think I found a tokenization error in the corpus!

There may be occasional tokenization inconsistencies in the corpus - our tokenization methods are automatic and our annotators are only human. Help us spot and fix these errors by reporting them by using the "Tokenization Error" template in the issues tab on Github.

  1. I found a bug in the data or evaluation methods. What now?

If you've found a non-tokenization related bug, please report it in the issues tab on Github by providing as much information as possible in order for us to recreate and fix the issue.

  1. I have questions about DeftEval or Codalab? Who do I ask?

If you're wondering about things related to the DeftEval shared task, or have Codalab-specific questions (e.g. How do you submit predictions via Codalab?), please pose your question to the DeftEval Google Group. If you need to contact the task organizers for any reason, please email [email protected].

  1. I only see dev and train data in the repo right now. When will the full dataset be released?

Because of our participation in SemEval 2020, we will be releasing the full dataset in line with relevant deadlines found here. The full labeled test set for all subtasks will be available after the final evaluation period ends in March 2020.

  1. What's the deal with these repeating sentences in the data??

In order to work with the CONLL-2003-like BIO sequence labeling format while still maintaining the ability for any given token to have 1 or more tag labels, we repeat sentences which contain tokens with overlapping labels. Occasionally, you may see a sentence which contains a nested definition (e.g., a term and definition appear inside another definition), a definition which is valid for more than one term, or an otherwise overlapping label sequence. In these cases, you will find the sentence repeated once for each overlapped term.