convert_doc_to_txt - TextpressoDevelopers/textpresso_classifiers GitHub Wiki
This script converts documents from pdf or CAS format to plain text. This can be helpful in case the same set of documents needs to be used to train or test classifiers more than once. Since tpclassifier internally converts documents to text files, converting them manually before importing them in the classifiers can save a lot of time.
Convert a file to txt
The script converts a single file provided in input and prints the text it contains to standard output. To save the result to file, the output must be redirected.
python3 convert_doc_to_txt.py -f pdf path/to/input/file > /path/to/output.txt
The option -f defines the input file format and can be "pdf", "cas_pdf", or "cas_xml".