Conversion Issues - tdclemens/pdf2htmlEX GitHub Wiki
Description: this document catalogs conversion issues found when converting pdfs to html via pdf2htmlEX. It gives examples of each problem and an estimate of the frequency of each problem. Also, It describes steps taken to reproduce the problem.
Frequency (low-medium)
*SS taken from HTML converted MFT-TEST-ASSEMBLED-LINKED-RGB.pdf
*SS taken from HTML converted clifford.pdf
How it was reproduced: have not been able to reproduce this yet using InDesign.
frequency (Medium-High)
pdf2htmlEX applies a with and a margin to spans to correct for curning
*SS taken from HTML converted MFT-TEST-ASSEMBLED-LINKED-RGB.pdf
*SS taken from HTML converted clifford.pdf
How it was reproduced: Many different fonts were used to create text in InDesign. The less traditional fonts when used with a large font size seem to exhibit the behavior more. More traditional smaller fonts don't seem to exhibit this behavior at all.
frequency(low)
This is not a problem on the kindle
*SS taken from HTML converted MFT-TEST-ASSEMBLED-LINKED-RGB.pdf
How it was reproduced: Two text boxes were aligned vertically in InDesign without spaces at the end of the first and the beginning of the second.
frequency(low)
*SS taken from HTML converted MFT-TEST-ASSEMBLED-LINKED-RGB.pdf
How it was reproduced: Have not been able to reproduce this in InDesign
Frequency(medium-high): This problem occurs every time text in the pdf is justified. sometimes it looks close to being justified, and other times it is significantly off.
This problem occurs because we use the command line option optimize text to remove some spans that interfere with word selection. Optimize text reduces the number of spans in a line and adjusts the letter spacing and word spacing of the entire line to account for this reduction. Its an imperfect approximation.
*SS taken from HTML converted Generation Kill.pdf
*SS taken from HTML converted clifford.pdf
How it was reproduced: created a large portion of generated "Lorem Ispum" text in InDesign. When this text was exported as a pdf and converted, it showed the justification issue.
frequency(low)
pdf2htmlEX guesses when to insert a space in its offset spans. It guesses based on the width of a space and the curning of characters. If a false positive occurs, a word will be broken by a space character.
*SS taken from HTML converted Fire-in-My-Belly-TEST-RGB-LINKED.pdf
How it was reproduced: Have not been able to reproduce this using InDesign.
frequency(low)
pdf2htmlEX guesses when to insert spaces between characters when it reduces spans with optimize text. It guesses based on the width of a space and the curning of characters. When this guessing renders a false positive, an extra space appears in the text output sometimes breaking up words.
*SS taken from HTML converted GS-26-pdftk.pdf
How it was reproduced: Have not been able to reproduce this using InDesign.
frequency(low-medium)
*SS taken from HTML converted clifford.pdf
*SS taken from HTML converted clifford.pdf
frequency(low-medium)
*SS taken from HTML converted Minecraft.pdf
frequency(low)
**SS taken from HTML converted
PDFs Referenced













