Home - WormBase/textpresso_classifiers GitHub Wiki
Welcome to the tpclassifer wiki!
Tpclassifier is a Python library that contains functions to train and apply classifiers for textual documents. It is based on Python scikit-learn library, and it provides an easy interface to train and use its classifiers. In addition, tpclassifier includes functions to transform documents from pdf and Textpresso CAS files (both generated from pdf or xml files) into text and simplify their management to create training and test sets for the classifiers.
This wiki is aimed at providing a set of practical examples on how to use tpclassifier, especially the scripts provided with the library. Check out ReadTheDocs for the detailed documentation of the functions in the library.
Tpclassifier executables
tpclassifier comes with a set of executable programs that use the functions in the library to provide an easy interface to train, test, and apply classifiers for pdf or CAS documents. These programs can be found in the bin directory of the project and are automatically installed in the system when the library is installed (this means that they are added to the PATH variable of the system and they can be executed from the command line in every location.
These are the programs included in the library:
Wormbase
This section is specific for Wormbase curation classifiers.
Wormbase tools
The package provides a set of scripts that are specifically designed for Wormbase. The main one, tp_classification_pipeline.sh, represents the classification pipeline of Textpresso CAS files for all papers in Wormbase. It is an incremental script that classifies new papers found in Textpresso cas directory and generates a report of the predictions.