Python Tesseract can't recognize this font

Just train the engine for the 10 digits and a '.' . That should do it. And make sure you change your image to grayscale before OCRing it.


Training is hard and is not what is really needed here. The distinction between O and 0 and l and 1 are going to be hard, no matter the script. Limiting the OCR to choose only between numerical digits greatly simplifies the problem, if the context permits it.

My interest in tesseract is in processing lots of numbers, from old government reports. In this case and in the case in question, the character set will be something like '0123456789.' Following a comment in the old (sourceforge) newsgroup for tesseract, by eric_taj on 2007-03-21, you can modify Templates->IndexFor and Templates->ClassIdFor in classify/intproto.cpp to mask off characters which are not to be allowed. I modified that approach a bit to read in the allowed character set at runtime in an environment variable, so that I can adjust the permitted set on the fly.