Does Tesseract's hOCR output really contain bounding boxes and confidence levels for each character?

You've seen it: it isn't there.

So you can either modify Tesseract source code to output hOCR format that supports x_confs property that you want or use its ResultIterator API class to get confidence at the character (symbol) level (be sure to SetVariable("save_blob_choices", "T") after Init method).


This now seems to be available in Tesseract 4.x.

See my answer at:

https://stackoverflow.com/a/57766860/1021819

Set hocr_char_boxes to 1 in your config file. Or, at the command line, your updated command would be:

tesseract [Image name] outputbase --oem 1 -l eng --psm 8 -c hocr_char_boxes=1 hocr Note the hocr output option and look in that file for ..._wconf, e.g.

Let me know if this works for you, otherwise I'll just delete the answer.

Source: https://github.com/tesseract-ocr/tesseract/issues/1465#issuecomment-513139976