Alternative annotation file formats - UCDenver-ccp/CRAFT GitHub Wiki
Previous releases of CRAFT included a variety of file formats for the different annotation types. In order to reduce the overall size of the distribution, redundant annotation files have been removed as of CRAFT v3.1. In their place is machinery that can produce alternative annotation file formats at the request of the user. Instructions for creating alternative file formats are below.
- Java 8 (or better)
- Clojure Boot (see installation instructions here)
Clojure Boot is a command line utility. It makes use of the build.boot
file that is in the base directory of the CRAFT distribution. Make sure you are in the base directory of the distribution (where the build.boot
file is located) when trying to run a boot
command. If you see an error message containing java.lang.IllegalArgumentException: No such task ([TASK_NAME])
then you may not be in the correct directory, or you may have a typo in [TASK_NAME].
Clojure Boot is a build utility that provides a straightforward interface for chaining various tasks together. Below are examples demonstrating tasks that a user might perform. In general, boot
commands for the CRAFT distribution take the following form:
boot [ANNOTATION_TYPE]+ [ANNOTATION_TYPE_PARAMS]* [ACTION] [ACTION_PARAMS]*
where +
signifies 1 or more
and *
signifies 0 or more
.
Annotation type | Description |
---|---|
all-concepts | indicates all concept types should be used. If -x parameters is set, then the extension class annotations are used |
concept | indicates a concept type should be used. Use -t to specify the type (one of CHEBI, CL, GO_BP, GO_CC, GO_MF, MOP, NCBITaxon, PR, SO, or UBERON) and -x if extension class annotations are desired. Note: concept types are case-sensitive. |
dependency | indicates that dependency parse annotations and relations should be included |
document-section | indicates that document section boundary annotations should be included. Typography annotations, e.g. italic, bold, etc. will also be included. |
coreference | indicates that coreference identity and apposition annotation should be included |
part-of-speech | indicates that tokens with part-of-speech tags should be included. Sentences are also included. |
treebank | indicates that treebank annotations should be included |
Action | Description |
---|---|
formats? | Requires at least one preceding annotation type in the command. This command outputs the formats to which the preceding annotation type(s) can be converted. |
convert | Requires at least one preceding annotation type in the command. This command converts the preceding annotation type(s) to a user-specified annotation file format. Use the -o parameter to specify the output directory. If >1 annotation types have been specified, the -o parameter must be set, otherwise if just single annotation type is specified then setting -o is optional. If not set, a directory will be created in the same directory that contains the native file format. Use the -n parameter to name the output files using the PubMed Central identifier instead of the default PubMed identifier. See the table below indicating the target file format parameters. |
knowtator-project | Requires at least one preceding annotation type in the command. This command creates the directory structure required for a Knowtator2 annotation project. Use the -o parameter to specify the directory where the Knowtator2 project files will be written. See the Knowtator2 annotation project creation wiki page for details. |
treebank-to-dependency | This command does not require preceding annotation types, and will ignore any that are present. This command automatically derives dependency parses files from the CRAFT treebank data. For details, see the Dependency derivation from treebank data wiki page. |
Parameter (short form, long form) | Description |
---|---|
-b, --bionlp | output BioNLP format |
-r, --brat | output BRAT format |
-i, --conll-coref-ident | output identity chains in CoNLL Coreference 2011/12 format |
-k, --knowtator2 | output Knowtator2 format |
-p, --pubannotation | output PubAnnotation format |
-m, --uima | output UIMA format |
-s, --sentence | output one sentence per line |
Note: The examples below show:
[BOOT_COMMAND]
[RESULTS_OF_BOOT_COMMAND]
boot dependency formats?
Annotation type: :dependency can be converted to the following formats: :uima,:bionlp,:brat,:knowtator2,:pubannotation
boot dependency convert --bionlp
converting (:dependency) annotations to :bionlp ...
output directory: /path/to/distribution/CRAFT.git/structural-annotation/dependency/bionlp
Example: Convert dependency annotations to the bionlp format using a user-specified output directory
boot dependency convert --bionlp -o /tmp/dependency/bionlp
converting (:dependency) annotations to :bionlp ...
output directory: /tmp/dependency/bionlp
boot concept -h
Indicates that concept annotations will be processed.
Options:
-h, --help Print this help info.
-t, --concept-type VAL VAL sets indicates annotation type to be processed. Must be one of CHEBI, CL, GO_BP, GO_CC, GO_MF, MOP, NCBITaxon, PR, SO, or UBERON. To indicate all concept types should be processed, use the all-concepts task instead. Note case-sensitivity in the concept types.
-x, --include-extensions indicates that extension classes should be included
boot concept -t CL concept -t PR -x convert -k -o /tmp/cl+pr/knowtator-2
converting (:PR+extensions :CL) annotations to :knowtator2 ...
output directory: /tmp/cl+pr/knowtator-2
- The first time
boot
is run, it will download some dependencies and may take a while to run (minutes). Subsequent runs will be faster as the dependencies generally only need to be downloaded once. - If you are interested in the underlying details of the file format conversion, along with the
build.boot
script that is part of the CRAFT distribution, the format conversions also rely on this code base.