Highlighting pattern strings in a large text

There is another way that is on my machine almost 500x faster then your solution. The idea is to look how Mathematica represents colored strings and use this directly.

When we colorize an input string by selecting text and using the Format menu, we can create something like this

Mathematica graphics

Now, press Ctrl+Shift+E to see the underlying expression.

Cell[BoxData["\"\<Hello 
  \!\(\*StyleBox[\"my\",FontColor->RGBColor[1, 0, 0]]\)
  friend\>\""], "Input"]

I have put the important part in the second like and you see, it's only an inline style-box that is used.

In your updated question, you used a list of words to highlight and for this task, there is another approach useful:

  • we create a function that takes a string and returns the same colorized string when it is in your list of words. Otherwise, it just returns the same string
  • we split your input into words and apply this function to each word
  • we rebuild all words into a string again which now contains normal text and highlighted words.

For this purpose, I use a Module that on-the-fly creates local functions that do the highlighting. This is important, because with each call to highlight you might want to provide a different list of words to highlight. Therefore, the function doHighlight needs to be rebuilt on every call.

Sounds expensive? It is not and the implementation is only a few lines long:

highlight[txt_, words_] := Module[{colorize, doHighlight},
  colorize[str_] := "\!\(\*StyleBox[\"" <> str <> 
    "\",FontColor->RGBColor[0, 0, 1]]\)";
  SetAttributes[doHighlight, {Listable}];
  (doHighlight[#] := colorize[#]) & /@ words;
  doHighlight[s_] := s;
  StringRiffle[doHighlight[StringSplit[txt]]]
]

Let's test it

Mathematica graphics

Now let us time this with the same input that Peter Roberge used. His function needed 3.7 seconds on my machine.

txt = ExampleData[{"Text", "AeneidEnglish"}];
somewords = DictionaryLookup[RegularExpression["[A-Z][a-z]+"]];

output = highlight[txt, somewords]; // AbsoluteTiming
(* {0.168501, Null} *)

And the text is highlighted as expected

enter image description here

Since you were brave enough to read until the end, let me tell you that there is one significant drawback: Mathematica has a bug and does not export colored strings to rtf correctly. At least on my machine, the text is not colorized in the final rtf.

Update

In case you really need to replace not a fixed word, but an expression you need to use StringReplace because it is possible you match more than one word (maybe a group of words). Therefore, splitting the text into words won't always work.

Nevertheless, the basic idea of my answer stays the same: We don't use Row and Style, but we inject inline string styles and transform a string into string.

The function itself becomes very easy:

highlight2[txt_, patterns_] := 
 StringReplace[txt, str : (Alternatives @@ patterns) :> 
   "\!\(\*StyleBox[\"" <> str <> "\",FontColor->RGBColor[0, 0, 1]]\)"
]

Here a short test with different kinds of patterns:

highlight2["Hello bear, what are you doing here?", 
  { "b" ~~ LetterCharacter .., 
    _ ~~ "o" ~~ _, 
    RegularExpression["[A-Z][a-z]+"], 
    "re?"
   }]

Mathematica graphics

Update to provide custom style

Providing a custom style is possible too. You can just add this as parameter and the only thing you have to do inside the function is to transform this into a string and put it at the right place.

That being said:

highlight2[txt_, patterns_] := highlight2[txt, patterns, {Blue}];
highlight2[txt_, patterns_, {style__}] := 
 StringReplace[txt, 
  str : (Alternatives @@ patterns) :> 
   "\!\(\*StyleBox[\"" <> str <> "\"," <> 
    StringRiffle[ToString /@ {style}, ", "]
    <> "]\)"]

You can now give a list of style directives as last argument. When you leave them out, then the matching text becomes blue.

highlight2["Hello bear, what are you doing here?", {"b" ~~ 
 LetterCharacter .., _ ~~ "o" ~~ _},
 {30, Red, Italic}]

Mathematica graphics


I really enjoy Mathematica when I can outsource tough algorithmic decisions to their source code- I believe this is the case here.

It appears as if your code is doing something expensive (searching and replacing) many different times.

I propose to do it all at once.

Benchmark:

txt = ExampleData[{"Text", "AeneidEnglish"}];
somewords = DictionaryLookup[RegularExpression["[A-Z][a-z]+"]];
AbsoluteTiming[txt /. Highlight[somewords, Style[#, Blue] &]][[1]]

82.7385

Set up:

txt = ExampleData[{"Text", "AeneidEnglish"}];
somewords = DictionaryLookup[RegularExpression["[A-Z][a-z]+"]];

Generate your rules:

rl = Flatten[# -> Style[#, Blue, Bold]] & /@ somewords;

Put rules on Virgil's Epic:

a = Row[{##}] & @@ StringReplace[txt , rl];

Second benchmark:

AbsoluteTiming[
  rl = Flatten[# -> Style[#, Blue, Bold]] & /@ somewords;
  a = Row[{##}] & @@ StringReplace[txt , rl];][[1]]

2.4377

Export:

Export["a.rtf", a]