How to TextRecognize a screenshot of code?

If you are truly interested in this topic, you need to go deeper. TextRecognize uses Tesseract under the hood and getting familiar with how to train its ML algorithm is important in understanding how you can improve the outcome.

To make this post not too long, here are the most important information:

  • Tesseract stores the information about recognizing a language in .traineddata files. You can load such files by using, e.g. Language -> File["wl.traineddata"] in the call to TextRecognize. This makes it possible that you train Tesseract on specifically created test-data to improve the recognition of source code
  • During training, Tesseract creates images from the training text. Therefore, it has both the ground-truth of the text and the image representation. When you train Tesseract yourself, you can keep the training images to see how the rendered text looks and if it resembles your real input.
  • You can train Tesseract on specific fonts which will be vital for good OCR

Improving the default TextRecognize

As said in my comment above, TextRecognize is trained for English text and not code that contains wild variable names and many additional characters that usually don't appear in written text. However, when I trained my own Tesseract, I saw that the text in the training images was pretty bold

img

Therefore, I first tried to use your image and binarize it myself to resemble the thickness to a better degree:

img = Import["https://i.stack.imgur.com/QYIYM.png"];
TextRecognize[Binarize[img, 0.9]]

Here is the result:

"we uvmstr = toLLvnIRL"c++ze", "

#include <ctre.hpp>
#include <vector>
#include <cstdio>
#include <string_view>
#include <iostream>

extern \"C\" I
bool is_date(int64_t * in, int64_t len) {
using namespace ctre::literals;
char buf[len + 1];
for (auto ii = 0; ii < len; ii++) buf[ii] = static_cast<char>(in[ii]);
buf[len] = '\\8';
const auto s = std::string(buf);
if (auto m = \"A([6-9](4))/([6-9]{1,2}+)/([6-9](1,2}+)$\"_ctre.match(s)) (
return true;

}

return false;

"1;

This is already not so bad. What is clearly missing are the [].

Training Tesseract on WL data

Then I tried to train Tesseract on WL code specifically. There is a tutorial video and a wiki page that shows how to do this for Tesseract and its new LSTM neural network. What I basically did was

  • Clone the tesseract repository
  • Build tesseract and the training tools from scratch
  • Created test text from Mathematica packages
  • Trained tesseract on the training text and the Source Code Pro font which is used in my front end

I really didn't spend too much time on this, because my main objective was to find out if we can use the .traineddata directly with Mathematica. If you follow the video, you see he uses a training script. I used the following

rm -rf train/*
tesstrain.sh \
  --fonts_dir /usr/local/Wolfram/Mathematica/12.0/SystemFiles/Fonts/ \
  --fontlist 'Source Code Pro Black' \
  --langdata_dir langdata_lstm \
  --lang eng \
  --training_text training_text \
  --wordlist training_words \
  --tessdata_dir tesseract/tessdata \
  --maxpages 10 \
  --save_box_tiff \
  --output_dir train
cp train/eng.traineddata wl.traineddata

My folder structure looked like this and you can see I also built leptonica and cloned the langdata_lstm

TesseractWL/
├── generate_training_data.sh
├── langdata_lstm
├── leptonica
├── tesseract
├── train
├── training_text
├── training_words
└── wl.traineddata

For the training data, I used a very simple approach which leaves a lot of room for improvement. I used the packages available as AddOns and joined them into a big text file, but I removed empty lines, indentation and I trimmed the code to 80 chars per line. Here is a hacky version

files = FileNames["*.m", {FileNameJoin[{$InstallationDirectory, "AddOns", 
      "Packages"}]}, Infinity];
packageCode = Function[file,
    Function[str, 
      StringTake[#, Min[80, StringLength[#]]] &[StringTrim[str]]] /@ 
     Select[StringSplit[
       Import[file, "String"],
       EndOfLine
       ], StringLength[StringTrim[#]] > 0 &]
    ] /@ files;

words = StringSplit[Import[#, "String"]] & /@ files // Flatten // DeleteDuplicates;

Export["TesseractWL/training_text", StringRiffle[Flatten[packageCode], "\n"], "String"]
Export["TesseractWL/training_words",StringRiffle[Take[words, 30000], "\n"], "String"]

However, if you are going to pursue this further, you should look at the training text for Englisch.

After that running generate_training_data.sh, I can use this

TextRecognize[
  img, 
  Language -> File["TesseractWL/wl.traineddata"]
]

Using one page of the training data (first image), TextRecognize does a pretty good job now (second image), although even the default TextRecognize works quite good.

i1 i2

Conclusion

To maximize the performance on source code recognition, one has to spend substantially more time with this topic and look at each step in the chain. The training I did, only improved an already existing classifier which bases on the English language. I'm not sure this is the best way and this should probably be discussed with one of the Tesseract developers. Maybe it would be advantageous to train a completely new language from the ground up.

As I said earlier, we trained the new LSTM neural network for Tesseract 4.0 but I'm not sure Mathematica even uses this. The approach still works because in the .traineddata file both to old and the new LSTM classifier information are stored. At least this is how I understood it.

Therefore, if I had to work on this, I would start with plain Tesseract and work towards a good source code recognition. Once this works, you can use it from Mathematica. The Tesseract Tools which are the core of TextRecognize are available in source under

FileNames["TesseractToolsImpl.m", {$InstallationDirectory}, Infinity]