Mindtagger - HazyResearch/mindbender GitHub Wiki

Mindtagger is an interactive data annotation tool primarily for developing knowledge base construction systems.

It is part of the Mindbender tool chain, which can be downloaded from the releases page. How you prepare its input and use its output are documented in a tutorial written for DeepDive that discusses labeling data products.

The following sections (will) serve as a reference manual for Mindtagger configuration and templates.

Using Mindtagger

TBD

Launching Mindtagger

Choosing a task

Adding tags to data items

Exporting tags

Keyboard shortcuts

Creating a new task

TBD

mindtagger.conf

Specifying input

Defining tag schema

Tag schema dictates what types of annotations Mindtagger should accept per input items. There are several types of tags Mindtagger supports:

Simple tags that do not have associated values and can be annotated to an item at most once.
Parametric tags that have associated values.
- by Arity
  - Unary parametric tags have a single value associated to them.
  - n-ary parametric tags have n parameter values associated to them. The values are identified by a parameter name, or an index number from 0 to n-1 if unnamed.
- by Multiplicity
  - Singleton parametric tags can be annotated to an item at most once.
  - Multiple parametric tags can be annotated to an item more than once with distinct parameter values.

Tag schema defined for a task is a JSON object whose keys are the tag names and values describe the type of the tag. Each value object describing the tag type can have the following keys defined:

type: either simple (default) or parametric.
params: one of the following where valid parameter type names are string, boolean, int, float, number, object. Ignored when type is simple.
- a parameter type name implying a unary tag
- a positive number for the arity
- an object whose keys are the parameter names and each value is the name of the parameter type.
values: optionally, permitted values for a unary tag can be enumerated as an array.

Here's an example schema for precision mode tasks, which defines a singleton unary tag is_correct, a multiple unary tag comment, and simple tags input error and duplicate:

{
  "is_correct": {
    "type": "parametric",
    "params": 1,
    "values": [true, false, "?"]
  },
  "comment": {
    "type": "parametric",
    "params": "string",
    "multiple": true
  },
  "input error": { "type": "simple" },
  "duplicate": {}
}

Here's another example schema for recall mode tasks over text, which defines multiple unary tags gene, phenotype that take the mention positions (described as an opaque object) in the appearing sentence as their only parameter, and a multiple binary tag expresses that takes the two mention positions as its two parameters:

{
  "gene": {
    "type": "parametric",
    "params": "object",
    "multiple": true
  },
  "phenotype": {
    "type": "parametric",
    "params": "object",
    "multiple": true
  },
  "expresses": {
    "type": "parametric",
    "params": {
      "gene": "object",
      "phenotype": "object"
    },
    "multiple": true
  }
}

The tag schema for a task can be provided in several ways:

In a schema.json file next to the mindtagger.conf for the task.
Using MindtaggerTask.defineTags(Object) in the task template.

Mindtagger uses distinct UI elements and data representations for different types of tags:

Simple tags are annotated with a push-toggle button displayed per item, and each item will either have the tag defined or not in its tag storage.
Parametric tags receive their parameter values through various user interactions, e.g., by selecting a span of text. Multiple buttons and UI elements are displayed for composing the parameter values and adding/removing tags to an item.
- Singleton parametric tags are represented directly, whereas multiple parametric tags are represented as an array under the name of the tag.
- Unary tags are represented directly as their only parameter value unless they are named.
- n-ary tags or named unary tags are represented as an object mapping the parameter names to their values.

Creating/extending task template

Task template defines the presentation of data items as well as the interactions allowed/supported in each task. Ordinary HTML syntax is used with a few extension:

A data item field can be inserted by surrounding the name with two curly braces, e.g., {{item.foo}} for inserting the value of foo.
Special directives and attributes can be decorated to existing HTML tags (or they can appear as tags wrapping a block of HTML).

In fact, Mindtagger is built with AngularJS, and any AngularJS directives and expressions can be used in the template.

General structure of a task template

TBD

<mindtagger mode="...">
  <template for="...">
    HTML fragment
  </template>
  ...
</mindtagger>

Rendering word arrays

In text-based DeepDive apps, you will find arrays of words used everywhere along with their NLP markups. Mindtagger provides a directive for presenting such an array as a normal sentence.

<span mindtagger-word-array="item.words"
      array-format="postgres">
  (words in the "words" column of the input will be rendered here as a normal sentence)
</span>

If the array is serialized in a certain way and the item.words is a flat string, array-format= should be specified. Valid values for array-format are:

postgres which let Mindtagger parse the string
python
json (default)

Highlighting words

When presenting the sentence, certain words can be highlighted, i.e., styled differently. Adding a mindtagger-highlight-words directive that specifies the word indexes and the desired style in CSS under the mindtagger-word-array directive will do the job. For example, if you have a column named mention_pos that holds an array of integers that are indexes of words to highlight the backgound in yellow, you can write as follows:

<span mindtagger-word-array="...">
  <mindtagger-highlight-words
   index-array="item.mention_pos"
   array-format="postgres"
   with-style="background-color: yellow;" />
</span>

There are several ways for specifying the word indexes:

from and to for a contiguous range of words by the beginning and ending indexes, e.g.:
```
<mindtagger-highlight-words from="4" to="12" ...>
```
from and length for a contiguous range of words by the beginning index and the length of words, e.g.:
```
<mindtagger-highlight-words from="4" length="9" ...>
```
froms and tos that hold arrays of beginning and ending indexes for multiple contiguous ranges of words. The two arrays should have the same length. Here's an example:
```
<mindtagger-highlight-words froms="[4,20,25]" tos="[12,21,29]" ...>
```
froms and lengths that hold arrays of beginning indexes and lengths for multiple contiguous ranges of words. The two arrays should have the same length. Here's an example:
```
<mindtagger-highlight-words froms="[4,20,25]" lengths="[9,2,5]" ...>
```
index-array for an array of possibly non-contiguous words, e.g.:
```
<mindtagger-highlight-words
 index-array="'{4,5,6,7,8,9,10,11,12}'"
 array-format="postgres" ...>
```
An optional array-format attribute can hint at how to parse the serialized string.

index-arrays that holds arrays of arrays of word indexes for multiple non-contiguous ranges of words, e.g.:

<mindtagger-highlight-words
 index-arrays="'{{4,5,6,7,8,9,10,11,12},{20,21},{25,27,29}}'"
 array-format="postgres" ...>

Making words selectable

TBD