Alternative annotation file formats - UCDenver-ccp/CRAFT GitHub Wiki

Previous releases of CRAFT included a variety of file formats for the different annotation types. In order to reduce the overall size of the distribution, redundant annotation files have been removed as of CRAFT v3.1. In their place is machinery that can produce alternative annotation file formats at the request of the user. Instructions for creating alternative file formats are below.

System requirements (please install before proceeding)

  • Java 8 (or better)
  • Clojure Boot (see installation instructions here)

Important: run boot from the base directory of the distribution

Clojure Boot is a command line utility. It makes use of the build.boot file that is in the base directory of the CRAFT distribution. Make sure you are in the base directory of the distribution (where the build.boot file is located) when trying to run a boot command. If you see an error message containing java.lang.IllegalArgumentException: No such task ([TASK_NAME]) then you may not be in the correct directory, or you may have a typo in [TASK_NAME].

API

Clojure Boot is a build utility that provides a straightforward interface for chaining various tasks together. Below are examples demonstrating tasks that a user might perform. In general, boot commands for the CRAFT distribution take the following form:

boot [ANNOTATION_TYPE]+ [ANNOTATION_TYPE_PARAMS]* [ACTION] [ACTION_PARAMS]*

where + signifies 1 or more and * signifies 0 or more.

Annotation Types

Annotation type Description
all-concepts indicates all concept types should be used. If -x parameters is set, then the extension class annotations are used
concept indicates a concept type should be used. Use -t to specify the type (one of CHEBI, CL, GO_BP, GO_CC, GO_MF, MOP, NCBITaxon, PR, SO, or UBERON) and -x if extension class annotations are desired. Note: concept types are case-sensitive.
dependency indicates that dependency parse annotations and relations should be included
document-section indicates that document section boundary annotations should be included. Typography annotations, e.g. italic, bold, etc. will also be included.
coreference indicates that coreference identity and apposition annotation should be included
part-of-speech indicates that tokens with part-of-speech tags should be included. Sentences are also included.
treebank indicates that treebank annotations should be included

Actions

Action Description
formats? Requires at least one preceding annotation type in the command. This command outputs the formats to which the preceding annotation type(s) can be converted.
convert Requires at least one preceding annotation type in the command. This command converts the preceding annotation type(s) to a user-specified annotation file format. Use the -o parameter to specify the output directory. If >1 annotation types have been specified, the -o parameter must be set, otherwise if just single annotation type is specified then setting -o is optional. If not set, a directory will be created in the same directory that contains the native file format. Use the -n parameter to name the output files using the PubMed Central identifier instead of the default PubMed identifier. See the table below indicating the target file format parameters.
knowtator-project Requires at least one preceding annotation type in the command. This command creates the directory structure required for a Knowtator2 annotation project. Use the -o parameter to specify the directory where the Knowtator2 project files will be written. See the Knowtator2 annotation project creation wiki page for details.
treebank-to-dependency This command does not require preceding annotation types, and will ignore any that are present. This command automatically derives dependency parses files from the CRAFT treebank data. For details, see the Dependency derivation from treebank data wiki page.

File format parameters for convert

Parameter (short form, long form) Description
-b, --bionlp output BioNLP format
-r, --brat output BRAT format
-i, --conll-coref-ident output identity chains in CoNLL Coreference 2011/12 format
-k, --knowtator2 output Knowtator2 format
-p, --pubannotation output PubAnnotation format
-m, --uima output UIMA format
-s, --sentence output one sentence per line

Examples

Example format

Note: The examples below show:

[BOOT_COMMAND]

[RESULTS_OF_BOOT_COMMAND]

Example: To what alternative formats can dependency annotations be converted?

boot dependency formats?

Annotation type: :dependency can be converted to the following formats: :uima,:bionlp,:brat,:knowtator2,:pubannotation

Example: Convert dependency annotations to the bionlp format

boot dependency convert --bionlp

converting (:dependency) annotations to :bionlp ...
output directory: /path/to/distribution/CRAFT.git/structural-annotation/dependency/bionlp

Example: Convert dependency annotations to the bionlp format using a user-specified output directory

boot dependency convert --bionlp -o /tmp/dependency/bionlp

converting (:dependency) annotations to :bionlp ...
output directory: /tmp/dependency/bionlp

Example: Show usage instructions for the concept task?

boot concept -h

Indicates that concept annotations will be processed.

Options:
-h, --help Print this help info.
-t, --concept-type VAL VAL sets indicates annotation type to be processed. Must be one of CHEBI, CL, GO_BP, GO_CC, GO_MF, MOP, NCBITaxon, PR, SO, or UBERON. To indicate all concept types should be processed, use the all-concepts task instead. Note case-sensitivity in the concept types.
-x, --include-extensions indicates that extension classes should be included

Example: Merge concept annotation types CL and PR+extensions and output to knowtator2 format

boot concept -t CL concept -t PR -x convert -k -o /tmp/cl+pr/knowtator-2

converting (:PR+extensions :CL) annotations to :knowtator2 ...
output directory: /tmp/cl+pr/knowtator-2

Notes

  • The first time boot is run, it will download some dependencies and may take a while to run (minutes). Subsequent runs will be faster as the dependencies generally only need to be downloaded once.
  • If you are interested in the underlying details of the file format conversion, along with the build.boot script that is part of the CRAFT distribution, the format conversions also rely on this code base.
⚠️ **GitHub.com Fallback** ⚠️