Highlight every occurrence of a list of words?

Solution using LuaTeX callbacks. Library luacolor.lua from luacolor is also used.

First package luahighlight.sty:

\ProvidesPackage{luahighlight}
%\RequirePackage{luacolor}
\@ifpackageloaded{xcolor}{}{\RequirePackage{xcolor}}
\RequirePackage{luatexbase}
\RequirePackage{luacode}
\newluatexattribute\luahighlight
\begin{luacode*}
highlight = require "highlight"
luatexbase.add_to_callback("pre_linebreak_filter", highlight.callback, "higlight")
\end{luacode*}

\newcommand\highlight[2][red]{
  \bgroup
  \color{#1}
  \luaexec{highlight.add_word("\luatexluaescapestring{\current@color}","\luatexluaescapestring{#2}")}
  \egroup
}

% save default document color
\luaexec{highlight.default_color("\luatexluaescapestring{\current@color}")}

% stolen from luacolor.sty
\def\luacolorProcessBox#1{%
  \luaexec{%
    oberdiek.luacolor.process(\number#1)%
  }%
}

% process a page box
\RequirePackage{atbegshi}[2011/01/30]
\AtBeginShipout{%
  \luacolorProcessBox\AtBeginShipoutBox
}
\endinput

command \highlight is provided, with one required and one optional parameters. required is highlighted word, optional is color. In pre_linebreak_filter callback, words are collected and when matched, color information is inserted.

Lua module, highlight.lua:

local M = {}

require "luacolor"

local words = {}
local chars = {}

-- get attribute allocation number and register it in luacolor
local attribute = luatexbase.attributes.luahighlight
-- local attribute = oberdiek.luacolor.getattribute
oberdiek.luacolor.setattribute(attribute)


-- make local version of luacolor.get

local get_color = oberdiek.luacolor.getvalue

-- we must save default color
local default_color 

function M.default_color(color)
  default_color = get_color(color)
end

local utflower = unicode.utf8.lower
function M.add_word(color,w)
  local w = utflower(w)
  words[w] = color
end

local utfchar = unicode.utf8.char

-- we don't want to include punctation
local stop = {}
for _, x in ipairs {".",",","!","“","”","?"} do stop[x] = true end


function M.callback(head)
  local curr_text = {}
  local curr_nodes = {}
  for n in node.traverse(head) do
    if n.id == 37 then
      local char = utfchar(n.char)
      -- exclude punctation
      if not stop[char] then 
        local lchar = chars[char] or utflower(char)
        chars[char] = lchar
        curr_text[#curr_text+1] = lchar 
        curr_nodes[#curr_nodes+1] = n
      end
      -- set default color
      local current_color = node.has_attribute(n,attribute) or default_color
      node.set_attribute(n, attribute,current_color)
    elseif n.id == 10  then
      local word = table.concat(curr_text)
      curr_text = {}
      local color = words[word]
      if color then
        print(word)
        local colornumber = get_color(color)
        for _, x in ipairs(curr_nodes) do
          node.set_attribute(x,attribute,colornumber)
        end
      end
      curr_nodes = {}
    end
  end
  return head
end


return M

we use pre_linebreak_filter callback to traverse the node list, we collect the glyph nodes (id 37) in a table and when we find a glue node (id 10, mainly spaces), we construct a word from collected glyphs. We have some prohibited characters (such as punctuation), which we strip out. All characters are lowercased, so we can detect even words at the beginning of sentences etc.

When a word is matched, we set attribute field of word glyphs to value under which is related color saved in luacolor library. Attributed are new concept in LuaTeX, they enable to store information in nodes, which can be processed later, as in our case, because at the shipout time, ale pages are processed by the luacolor library and nodes are colored, depending on their luahighlight attribute.

\documentclass{article}

\usepackage[pdftex]{xcolor}
\usepackage{luahighlight}
\usepackage{lipsum}

\highlight[red]{Lorem}
\highlight[green]{dolor}
\highlight[orange]{world}
\highlight[blue]{Curabitur}
\highlight[brown]{elit}
\begin{document}

\def\world{earth}
\section{Hello world}

Hello world, world? world! \textcolor{purple}{but normal colors works} too\footnote{And also footnotes, for instance. World WORLD wOrld}. Hello \world.

\lipsum[1-12]
\end{document}

enter image description here enter image description here


Here's another with l3regex.

\documentclass{scrartcl}
\usepackage{xcolor,xparse,l3regex}
\ExplSyntaxOn
\NewDocumentCommand \texthighlight { +m } { \david_texthighlight:n { #1 } }
\cs_new_protected:Npn \david_texthighlight:n #1
 {
  \group_begin:
  \tl_set:Nn \l_tmpa_tl { #1 }
  \seq_map_inline:Nn \g_david_highlight_colors_seq
   {
    \clist_map_inline:cn { g_david_highlight_##1_clist }
     {
      \regex_replace_all:nnN { (\W)####1(\W) }
       { \1\c{textcolor}\cB\{##1\cE\}\cB\{####1\cE\}\2 } \l_tmpa_tl
     }
   }
  \tl_use:N \l_tmpa_tl
  \group_end:
 }
\seq_new:N \g_david_highlight_colors_seq
\NewDocumentCommand \addhighlighting { O{red} m }
 {
  \seq_if_in:NnF \g_david_highlight_colors_seq { #1 }
   { \seq_gput_right:Nn \g_david_highlight_colors_seq { #1 } }
  \clist_if_exist:cF { g_david_highlight_#1_clist }
   { \clist_new:c { g_david_highlight_#1_clist } }
  \clist_gput_right:cn { g_david_highlight_#1_clist } { #2 }
 }
\ExplSyntaxOff

\addhighlighting{amet,Mauris,ut,et,leo}
\addhighlighting[blue]{Phasellus,vestibulum}

\begin{document}
\texthighlight{Lorem ipsum dolor foo sit amet, bar consectetuer adipiscing
elit. Ut purus elit, vestibulum ut, placerat ac, adipiscing vitae, felis.
Curabitur dictum gravida mauris. Nam arcu libero, nonummy eget,
consectetuer id, vulputate a, magna. Donec vehicula augue eu
neque. Pellentesque habitant morbi tristique senectus et netus et
malesuada fames ac turpis egestas. Mauris ut leo. Cras viverra metus
rhoncus sem. Nulla et lectus foo vestibulum urna fringilla ultrices.
Phasellus eu tellus sit amet tortor gravida placerat. Integer sapien
est, iaculis in, pretium quis, viverra ac, nunc. Praesent eget sem
vel leo ultrices bibendum. Aenean faucibus. Morbi dolor nulla,
malesuada eu, pulvinar at, mollis ac, nulla. Curabitur auctor semper
nulla. Donec varius orci eget risus. Duis nibh mi, congue eu,
accumsan eleifend, bar sagittis quis, diam. Duis eget orci sit amet orci
dignissim rutrum.

Lorem ipsum dolor sit amet, consectetuer adipiscing elit. Ut
purus elit, vestibulum ut, placerat ac, adipiscing vitae, felis. Curabitur
dictum gravida mauris. Nam arcu libero, nonummy eget,
consectetuer id, foo vulputate a, magna. Donec vehicula augue eu
neque. Pellentesque habitant morbi tristique senectus et netus et
malesuada fames ac turpis egestas. Mauris ut leo. Cras viverra metus
rhoncus sem. Nulla et lectus vestibulum urna fringilla ultrices.
Phasellus eu tellus sit amet tortor gravida placerat. Integer sapien
est, iaculis in, pretium quis, viverra ac, bar nunc. Praesent eget sem
vel leo ultrices bibendum. Aenean faucibus. Morbi dolor nulla,
malesuada eu, pulvinar at, mollis ac, nulla. Curabitur auctor semper
nulla. Donec varius orci eget risus. Duis nibh mi, congue eu,
accumsan eleifend, sagittis quis, diam. Duis eget orci sit amet orci
dignissim rutrum.}
\end{document}

enter image description here


Strongly based on my answer at How to insert a symbol to the beginning of a line for which a word appears?. However, I had to extend the logic to handle multiple color assignments. Syntax is multiple invocations of \WordsToNote{space separated list}{color} and then \NoteWords{multiple paragraph input}

Macros in the input are limited to style (e.g., \textit) and size (e.g., \small) changes. Otherwise, only plain text is accepted.

As detailed in the referenced answer, I adapt my titlecaps package, which normally capitalizes the first letter of each word in its argument, with a user-specified list of exceptions. Here, instead of capitalizing the words, I leave them intact. However, I trap the user-specified word exceptions and use them to set a different color.

In this extension of that method, I had to revise two titlecaps macros: \titlecap and \seek@lcwords.

The method cannot handle word subsets, but it can ignore punctuation.

EDITED to fix bug when flagged word appears with punctuation, and issue with first word of paragraphs.

\documentclass{article}
\usepackage{titlecaps}
\makeatletter
\renewcommand\titlecap[2][P]{%
  \digest@sizes%
  \if T\converttilde\def~{ }\fi%
  \redefine@tertius%
  \get@argsC{#2}%
  \seek@lcwords{#1}%
  \if P#1%
    \redefine@primus%
    \get@argsC{#2}%
    \protected@edef\primus@argi{\argi}%
  \else%
  \fi%
  \setcounter{word@count}{0}%
  \redefine@secundus%
  \def\@thestring{}%
  \get@argsC{#2}%
  \if P#1\protected@edef\argi{\primus@argi}\fi%
  \whiledo{\value{word@count} < \narg}{%
    \addtocounter{word@count}{1}%
    \if F\csname found@word\roman{word@count}\endcsname%
      \notitle@word{\csname arg\roman{word@count}\endcsname}%
      \expandafter\protected@edef\csname%
           arg\roman{word@count}\endcsname{\@thestring}%
    \else
      \notitle@word{\csname arg\roman{word@count}\endcsname}%
      \expandafter\protected@edef\csname%
         arg\roman{word@count}\endcsname{\color{%
           \csname color\romannumeral\value{word@count}\endcsname}%
      \@thestring\color{black}{}}%
    \fi%
  }%
  \def\@thestring{}%
  \setcounter{word@count}{0}%
  \whiledo{\value{word@count} < \narg}{%
    \addtocounter{word@count}{1}%
    \ifthenelse{\value{word@count} = 1}%
   {}{\add@space}%
    \protected@edef\@thestring{\@thestring%
      \csname arg\roman{word@count}\endcsname}%
  }%
  \let~\SaveHardspace%
  \@thestring%
  \restore@sizes%
\un@define}

% SEARCH TERTIUS CONVERTED ARGUMENT FOR LOWERCASE WORDS, SET FLAG
% FOR EACH WORD (T = FOUND IN LIST, F= NOT FOUND IN LIST)
\renewcommand\seek@lcwords[1]{%
\kill@punct%
  \setcounter{word@count}{0}%
  \whiledo{\value{word@count} < \narg}{%
    \addtocounter{word@count}{1}%
    \protected@edef\current@word{%
      \csname arg\romannumeral\value{word@count}\endcsname}%
    \def\found@word{F}%
    \setcounter{lcword@index}{0}%
    \expandafter\def\csname%
            found@word\romannumeral\value{word@count}\endcsname{F}%
    \whiledo{\value{lcword@index} < \value{lc@words}}{%
      \addtocounter{lcword@index}{1}%
      \protected@edef\current@lcword{%
        \csname lcword\romannumeral\value{lcword@index}\endcsname}%
%% THE FOLLOWING THREE LINES ARE FROM DAVID CARLISLE
  \protected@edef\tmp{\noexpand\scantokens{\def\noexpand\tmp%
   {\noexpand\ifthenelse{\noexpand\equal{\current@word}{\current@lcword}}}}}%
  \tmp\ifhmode\unskip\fi\tmp
%%
      {\expandafter\def\csname%
            found@word\romannumeral\value{word@count}\endcsname{T}%
      \expandafter\protected@edef\csname color\romannumeral\value{word@count}\endcsname{%
       \csname CoLoR\csname lcword\romannumeral\value{lcword@index}\endcsname\endcsname}%
      \setcounter{lcword@index}{\value{lc@words}}%
      }%
      {}%
    }%
  }%
\if P#1\def\found@wordi{F}\fi%
\restore@punct%
}
\makeatother
\usepackage{xcolor}
\newcommand\WordsToNote[2]{\Addlcwords{#1}\edef\assignedcolor{#2}%
  \assigncolor#1 \relax\relax}
\def\assigncolor#1 #2\relax{%
  \expandafter\edef\csname CoLoR#1\endcsname{\assignedcolor}%
  \ifx\relax#2\else\assigncolor#2\relax\fi%
}
\newcommand\NoteWords[1]{\NoteWordsHelp#1\par\relax}
\long\def\NoteWordsHelp#1\par#2\relax{%
  \titlecap[p]{#1}%
  \ifx\relax#2\else\par\NoteWordsHelp#2\relax\fi%
}
\begin{document}
\WordsToNote{foo bar at}{red}
\WordsToNote{Nulla dolor nulla}{cyan}
\WordsToNote{amet est et}{orange}
\WordsToNote{Lorem Ut ut felis}{green}
\NoteWords{
\textbf{Lorem ipsum dolor foo sit amet, bar consectetuer adipiscing elit}. Ut
purus elit, vestibulum ut, placerat ac, adipiscing vitae, felis. Curabitur
dictum gravida mauris. Nam arcu libero, nonummy eget,
consectetuer id, vulputate a, magna. Donec vehicula augue eu
neque. Pellentesque habitant morbi tristique senectus et netus et
malesuada fames ac turpis egestas. Mauris ut leo. Cras viverra metus
rhoncus sem. \textit{Nulla et lectus foo} vestibulum urna fringilla ultrices.
Phasellus eu tellus sit amet tortor gravida placerat. Integer sapien
est, iaculis in, pretium quis, viverra ac, nunc. Praesent eget sem
vel leo ultrices bibendum. \scshape Aenean faucibus. Morbi dolor nulla,
malesuada eu, pulvinar at, mollis ac, nulla. Curabitur auctor semper
nulla. Donec varius orci eget risus. \upshape Duis nibh mi, congue eu,
accumsan eleifend, bar sagittis quis, diam. Duis eget orci sit amet orci
dignissim rutrum.

\textsf{Lorem ipsum dolor sit amet}, consectetuer adipiscing elit. Ut
purus elit, vestibulum ut, placerat ac, adipiscing vitae, felis. Curabitur
dictum gravida mauris. Nam arcu libero, nonummy eget,
consectetuer id, foo vulputate a, magna. Donec vehicula augue eu
neque. Pellentesque habitant morbi tristique senectus et netus et
malesuada fames ac turpis egestas. Mauris ut leo. Cras viverra metus
rhoncus sem. Nulla et lectus vestibulum urna fringilla ultrices.
Phasellus eu tellus sit amet tortor gravida placerat. Integer sapien
est, iaculis in, pretium quis, viverra ac, bar nunc. Praesent eget sem
vel leo ultrices bibendum. Aenean faucibus. Morbi dolor nulla,
malesuada eu, pulvinar at, mollis ac, nulla. Curabitur auctor semper
nulla. Donec varius orci eget risus. Duis nibh mi, congue eu,
accumsan eleifend, sagittis quis, diam. \Large Duis eget orci sit amet orci
dignissim rutrum.\normalsize
}
\end{document}

enter image description here