Is direct utf8 input of combining diacritics in math mode possible with lualatex?

unicode-math does not set \mathcode for Unicode accents the same it does with other Unicode characters like math italics, so TeX looks for them in the first math font which is Computer Modern Math Italic (cmmi10 in the log) which does not have the accents (not in the Unicode positions at least).

But even if unicode-math did set the \mathcode the math accent will not be positioned properly (as you already noted), because accents must be called with \(U|XeTeX)mathaacent primitive for TeX to do its math accent positioning magic.

It might be possible to make the accents active math characters and map them to the respective macros (unicode-math already does this sort of tricky to allow direct input of other Unicode characters), but this is left as an exercise to the reader (read: I don’t know how to do this and last time I tried to understand that code I was on the verge of losing my sanity).

The engine itself knows nothing about Unicode characters, it the responsibility of the user (or macro package writer) to tell it which character is to be treated as an accent or a big operator or an opening symbol etc. using the appropriate primitive and/or math code (otherwise things would be very inflexible).


I got it working with a lua script. Your minimal example becomes:

\documentclass{minimal}
\usepackage{unicode-math}
\setmathfont{XITS Math}
\AtBeginDocument{\directlua{require("combining_preprocessor.lua")}}
\newcommand{\⃗}[1]{\ensuremath{\vec{#1}}}
\begin{document}
$v⃗$
\end{document}

The idea is that it's difficult to make LaTeX handle a command or macro that comes after its argument, which is how Unicode combining characters work, so we use would like a preprocessor to move the accent so it comes before its argument. That is, map v⃗ to \⃗{v} in a script, and then define whatever action you want \⃗ to have. (That's a backslash followed by a combining arrow, which should be printed above the backslash.)

My lua script does most (all?) of the combining characters, so you just need to define what they should do in the .tex file. Many accents on the same character is possible. Example:

\documentclass{minimal}

\usepackage{unicode-math}
\setmathfont{XITS Math}

\AtBeginDocument{\directlua{require("combining_preprocessor.lua")}}

\newcommand{\̂}[1]{\ensuremath{\hat{#1}}}
\newcommand{\⃑}[1]{\ensuremath{\vec{#1}}}
\newcommand{\̱}[1]{\ensuremath{\underline{#1}}}
\newcommand{\́}[1]{\ensuremath{\acute{#1}}}

\usepackage{stackrel}
\newcommand{\᷽}[1]{\ensuremath{\stackrel[\approx]{}{#1}}}

\begin{document}

Hello

$ℂ̂$ is hat on $ℂ$, more on $ℂ̂⃑$ (stress test)

$ℂ̂ x̂$

Many combining accents on $x᷽̱̂́⃑$ is cool.

\end{document}

(My browser doesn't do the many combining characters justice here, but it looks nice in the PDF file.)

Not sure if this is the ideal way of doing things, but for what it's worth, here is combining_preprocessor.lua:

function minornil(a, b)
   if a == nil and b == nil then
      return nil
   elseif a == nil then
      return b
   elseif b == nil then
      return a
   else
      return math.min(a, b)
   end
end

function findfirstcombining(line, n)
   local a = string.find(line, "\204[\128-\191]", n)     -- From U0300,
   local b = string.find(line, "\205[\128-\175]", n)     -- to U036F.
   a = minornil(a, b)
   b = string.find(line, "\226\131[\144-\176]", n) -- U20D0 to U20F0
   a = minornil(a, b)
   b = string.find(line, "\225\183[\128-\191]", n) -- U1DC0 to U1DFF
   a = minornil(a, b)
   return a
end

function is_utf8_continuation(byte)
   return byte < 191 and byte > 127
end

function find_next_utf8_char(str, n)
   while str:byte(n) ~= nil and is_utf8_continuation(str:byte(n)) do
      n = n + 1
   end
   return n
end

function combining_iter(str)
   local n = 0
   return function ()
      n = (n ~= nil) and findfirstcombining(str, n + 1)
      return n
   end
end

function dobuffer(line)
   local n1 = 0
   local t = {}
   for n2 in combining_iter(line) do
      if n2 > n1 then
         local n3 = n2
         repeat
            n3 = n3 - 1
         until not is_utf8_continuation(line:byte(n3))
         table.insert(t, string.sub(line, n1, n3 - 1))
         n1 = find_next_utf8_char(line, n2 + 1)
         local comb = {}
         table.insert(comb, "\\" .. string.sub(line, n2, n1 - 1) .. "{")
         table.insert(comb, string.sub(line, n3, n2 - 1) .. "}")
         n2 = findfirstcombining(line, n1)
         while n2 == n1 do
            n1 = find_next_utf8_char(line, n2 + 1)
            table.insert(comb, 1, "\\" .. line:sub(n2, n1 - 1) .. "{")
            table.insert(comb, "}")
            n2 = findfirstcombining(line, n1)
         end
         table.insert(t, table.concat(comb))
      end
   end
   table.insert(t, string.sub(line, n1))
   return table.concat(t)
end

luatexbase.add_to_callback("process_input_buffer",
                           dobuffer, "combining_preprocessor", 1)