tp_doc_classifier - TextpressoDevelopers/textpresso_classifiers GitHub Wiki

This Python3 executable file can be used to train a binary classifier for pdf or CAS files, save it to file, and use it to classify new documents.

To execute it, run it from the command line as shown below.

$ python3 tp_doc_classifier.py -h

The detailed documentation of the arguments that can be passed to the script is included in the help page of the program.

These are some example use cases for the program:

Train a classifier from a set of pdf files

First of all, the documents must be placed in one directory with two sub-directories, the first named "positive" containing positive samples, and the second named "negative" for negative ones. Then, the path to the directory has to be passed as a parameter to the program with the option -t (train).

$ python3 tp_doc_classifier.py -t /path/to/training/dir -c /path/to/file/to/save -f pdf

Note that the trained classifier is not saved to file by default. The option -c is used to save the classifier. The option -f is used to specify the file type. If not specified, the default value is set to "pdf". See the help page for other possible values.

Model options

There are several parameters that can be tuned while training the classifier:

  • model type (option -m): Specify the model to use for the classifier. Several models are available. See the help page for the full list.
  • tokenizer type (option -z): Specify the tokenizer to use for feature extraction (transformation of text documents into feature vectors). Two tokenizers are currently supported: BOW for a simple bag of words and TFIDF for Term Frequency-Inverse Document Frequency tokenizer.
  • ngram size (option -n): Use n-grams with the specified size. N-grams are features composed of multiple words that occurr close to each other in the text. This option controls the size of the n-gram. If set to 1, single words are used as features.
  • feature selection (option -b): Use only the k best features according to chi-squared feature selection method.
  • lemmatization (option -l): Whether to apply lemmatization before feature extraction.
  • include or exclude features manually (option -i and -e): Include (or exclude) a set of features (n-grams) specified in a file and separated by newlines.

Test a classifier

By providing the option -T (test) to the program, the classifier is automatically trained using a random subset of 80% of the training set and it is tested on the remaining 20% of the observations. The program returns the precision, recall and accuracy of the model calculated on the test set, separated by tab.

$ python3 tp_doc_classifier.py -t /path/to/training/dir -T
0.9\t1.0\t0.95

Use the classifier on new documents

To predict the classes of a set of documents, use the option -p (predict) and pass a path to a directory containing a set of pdf or CAS files.

$ python3 tp_doc_classifier.py -t /path/to/training/dir -p /path/to/new/docs
WBPaper00035228.pdf\t1
WBPaper00045669.pdf\t0

Where the directory /path/to/new/docs contained the files WBPaper00035228.pdf and WBPaper00045669.pdf. The output of the program in this case is, for each document to classify, a line with the file name and the predicted class, separated by tab.

Note that in the previous example we used the program to train the classifier and predict the classes of new documents at the same time, but the prediction can also be performed from a previously saved classifier.

$ python3 tp_doc_classifier.py -c /path/to/saved/classifier.pkl -p /path/to/new/docs
WBPaper00035228.pdf\t1
WBPaper00045669.pdf\t0

Save the set of features of the model and their score to file

The vocabulary used for feature extraction (i.e., the set of n-grams used as features) and their score (the chi-squared values obtained through feature selection - if applied) can be saved to file by using the option -v.

$ python3 tp_doc_classifier.py -c /path/to/saved/classifier.pkl -v /path/to/vocabulary/file

The saved file will contain, for each feature of the classifier, one row with the text of the feature (n-gram) and its chi-squared score, separated by tab. If feature selection has not been applied to the classifier, the features will have all score equal to zero.

Combining the scenarios

The scenarios described above can be combined together with the script, by providing the aforementioned options at the same time.