OcrToObjects - seandenigris/SeansPlayground GitHub Wiki

Procedure for turning OCRed scanned documents into e.g. outline objects Input Format

HTML seems best.
Plain text: the problem is that you lose all formatting. So, for example, if underlining is used to mark paragraph titles, this information will disappear.
RTF: it works, but the format seems complicated, so why bother. There doesn't seem to be a benefit over HTML