Devanagari UTF-8 text extraction code produces lua error while using HarfBuzz renderer

I don't know too much about how HarfBuzz fonts are handled by Luaotfload, but I was able to find the way how to get the tounicode fields, thanks to table.serialize. So my original code adapted for Harfbuzz looks like this:

\documentclass{article}
\usepackage[lmargin=0.5in,tmargin=0.5in,rmargin=0.5in,bmargin=0.5in]{geometry}
\usepackage{fontspec}
\usepackage{microtype}
\usepackage{luacode}

%\newfontscript{Devanagari}{deva,dev2}
\newfontfamily{\devanagarifam}{Noto Sans Devanagari}[Script=Devanagari, Scale=1, Renderer=HarfBuzz]
\newfontfamily{\arabicfam}{Amiri}[Script=Arabic, Scale=1, Renderer=HarfBuzz]

\begin{document}

\setbox0=\hbox{Příliš žluťoučký \textit{kůň} úpěl \hbox{ďábelské} ódy difference diffierence. \devanagarifam एक गांव -- में मोहन नाम का लड़का रहता था। उसके पिताजी एक मामूली मजदूर थे।}
\setbox1=\hbox{\arabicfam \textdir TRT  هذه المقالة عن براغ. لتصفح عناوين مشابهة، انظر براغ (توضيح).}

\begin{luacode*}
  -- local fontstyles = require "l4fontstyles"
  local char = unicode.utf8.char
  local glyph_id = node.id("glyph")
  local glue_id  = node.id("glue")
  local hlist_id = node.id("hlist")
  local vlist_id = node.id("vlist")
  local disc_id  = node.id("disc")
  local minglue  = tex.sp("0.2em")
  local usedcharacters = {}
  local identifiers = fonts.hashes.identifiers
  local fontcache = {}

  local function to_unicode_chars(uchar)
    local uchar = uchar or ""
    -- put characters into a table
    local current = {}
    -- each codepoint is 4 bytes long, we loop over tounicode entry and cut it into 4 bytes chunks
    for i= 1, string.len(uchar), 4 do
      local cchar = string.sub(uchar, i, i + 3)
      -- codepoint is hex string, we need to convert it to number ad then to UTF8 char
      table.insert(current,char(tonumber(cchar,16)))
    end
    return current
  end
  -- cache character lookup, to speed up things
  local function get_character_from_cache(xchar, font_id)
    local current_font = fontcache[font_id] or {characters = {}}
    fontcache[font_id] = current_font -- initialize font cache for the current font if it doesn't exist
    return current_font.characters[xchar]
  end

  -- save characters to cache for faster lookup
  local function save_character_to_cache(xchar, font_id, replace)
    fontcache[font_id][xchar] = replace
    -- return value
    return replace
  end

  local function initialize_harfbuzz_cache(font_id, hb)
    -- save some harfbuzz tables for faster lookup
    local current_font = fontcache[font_id]
    -- the unicode data can be in two places
    -- 1. hb.shared.glyphs[glyphid].backmap
    current_font.glyphs = current_font.glyphs or hb.shared.glyphs
    -- 2. hb.shared.unicodes 
    -- it contains mapping between Unicode and glyph id
    -- we must create new table that contains reverse mapping
    if not current_font.backmap then 
      current_font.backmap = {} 
      for k,v in pairs(hb.shared.unicodes) do
        current_font.backmap[v] = k
      end
    end
    -- save it back to the font cache
    fontcache[font_id] = current_font
    return current_font.glyphs, current_font.backmap
  end

  local function get_unicode(xchar,font_id)
    -- try to load character from cache first
    local current_char = get_character_from_cache(xchar, font_id) 
    if current_char then return current_char end
    -- get tounicode for non HarfBuzz fonts
    local characters = identifiers[font_id].characters
    local uchar = characters[xchar].tounicode
    -- stop processing if tounicode exists
    if uchar then return save_character_to_cache(xchar, font_id, to_unicode_chars(uchar)) end
    -- detect if font is processed by Harfbuzz
    local hb = identifiers[font_id].hb
    -- try HarfBuzz data
    if not uchar and hb then 
      -- get glyph index of the character
      local index = characters[xchar].index
      -- load HarfBuzz tables from cache
      local glyphs, backmap = initialize_harfbuzz_cache(font_id, hb)
      -- get tounicode field from HarfBuzz glyph info
      local tounicode = glyphs[index].tounicode
      if tounicode then
        return save_character_to_cache(xchar, font_id, to_unicode_chars(tounicode))
      end
      -- if this fails, try backmap, which contains mapping between glyph index and Unicode
      local backuni = backmap[index]
      if backuni then 
        return save_character_to_cache(xchar, font_id, {char(backuni)})
      end
      -- if this fails too, discard this character
      return save_character_to_cache(xchar, font_id, {})
    end
    -- return just the original char if everything else fails
    return save_character_to_cache(xchar, font_id, {char(xchar)})
  end

  local function nodeText(n)
    -- output buffer
    local t =  {}
    for x in node.traverse(n) do
      -- glyph node
      if x.id == glyph_id then
        -- get table with characters for current node.char
        local chars = get_unicode(x.char,x.font)
        for _, current_char in ipairs(chars) do
          -- save characters to the output buffer
          table.insert(t,current_char)
        end
      -- glue node
      elseif x.id == glue_id and  node.getglue(x) > minglue then
        table.insert(t," ")
      -- discretionaries
      elseif x.id == disc_id then
        table.insert(t, nodeText(x.replace))
      -- recursivelly process hlist and vlist nodes
      elseif x.id == hlist_id or x.id == vlist_id then
        table.insert(t,nodeText(x.head))
      end
    end
    return table.concat(t)
  end
  local n = tex.getbox(0)
  local n1 = tex.getbox(1)
  print(nodeText(n.head))
  local f = io.open("hello.txt","w")
  f:write(nodeText(n.head))
  f:write(nodeText(n1.head))
  f:close()
\end{luacode*}

\box0

\box1
\end{document}

I've added also an Arabic sample from Wikipedia. Here is the content of hello.txt:

Příliš žluťoučký kůň úpěl ďábelské ódy difference diffierence. एक गांव -- में मोहन नाम का लड़का रहता था। उसके पताजी एक मामूली मजदूर थे।هذه المقالة عن براغ. لتصفح عناوين مشابهة، انظر براغ (توضيح).

Two important functions are these

  local function to_unicode_chars(uchar)
    local uchar = uchar or ""
    local current = {}
    for i= 1, string.len(uchar), 4 do
      local cchar = string.sub(uchar, i, i + 3)
      table.insert(current,char(tonumber(cchar,16)))
    end
    return current
  end

to_unicode_chars function splits to_unicode entries to four byte chunks, which are then converted to UTF 8 characters. It can also handle glyphs without tounicode entries, in this case it just returns empty string.

  local function get_unicode(xchar,font_id)
    -- try to load character from cache first
    local current_char = get_character_from_cache(xchar, font_id) 
    if current_char then return current_char end
    -- get tounicode for non HarfBuzz fonts
    local characters = identifiers[font_id].characters
    local uchar = characters[xchar].tounicode
    -- stop processing if tounicode exists
    if uchar then return save_character_to_cache(xchar, font_id, to_unicode_chars(uchar)) end
    -- detect if font is processed by Harfbuzz
    local hb = identifiers[font_id].hb
    -- try HarfBuzz data
    if not uchar and hb then 
      -- get glyph index of the character
      local index = characters[xchar].index
      -- load HarfBuzz tables from cache
      local glyphs, backmap = initialize_harfbuzz_cache(font_id, hb)
      -- get tounicode field from HarfBuzz glyph info
      local tounicode = glyphs[index].tounicode
      if tounicode then
        return save_character_to_cache(xchar, font_id, to_unicode_chars(tounicode))
      end
      -- if this fails, try backmap, which contains mapping between glyph index and Unicode
      local backuni = backmap[index]
      if backuni then 
        return save_character_to_cache(xchar, font_id, {char(backuni)})
      end
      -- if this fails too, discard this character
      return save_character_to_cache(xchar, font_id, {})
    end
    -- return just the original char if everything else fails
    return save_character_to_cache(xchar, font_id, {char(xchar)})
  end

This function first tries to load Uniocode data from current font info. If it fails, it tries to lookup in Harfbuzz tables. Most of characters have tounicode mapping in the glyphs table. If it isn't available, it tries the unicodes table, which contains mapping between glyph indices and Unicode. If even this fails, then we discard this character.


In node mode, restoring the full text is not generally possible because you get shaped output and shaped glyphs can not be uniquely mapped back to input text. You can only approximate it by using tounicode values. These map to the actual PDF file ToUnicode CMap entries and therefore follow their restricted model of glyph to Unicode mapping: Every glyph is equivalent to a fixed sequence of unicode codepoints. These mappings are the concatenated in rendering order. As you have seen, this model is not sufficient to map Devanagari glyphs to input text.

You can use harf mode instead to avoid the issue: harf mode is not affected by this limited model because it does not just give you a shaped list of glyphs, but additionally creates PDF marked content ActualText entries overriding the ToUnicode mappings in sequences which can not be correclty modeled though ToUnicode. The data necessary for this mapping can be queried from Lua code using the glyph_data property. (This is an undocumented implementation detail and might change in the future)

If xou want to extract as much as possible from any text, you can combine this property based and the ToUnicode based approach in your Lua code:

Create a file extracttext.lua with

local type = type
local char = utf8.char
local unpack = table.unpack
local getproperty = node.getproperty
local getfont = font.getfont
local is_glyph = node.is_glyph

-- tounicode id UTF-16 in hex, so we need to handle surrogate pairs...
local utf16hex_to_utf8 do -- Untested, but should more or less work
  local l = lpeg
  local tonumber = tonumber
  local hex = l.R('09', 'af', 'AF')
  local byte = hex * hex
  local simple = byte * byte / function(s) return char(tonumber(s, 16)) end
  local surrogate = l.S'Dd' * l.C(l.R('89', 'AB', 'ab') * byte)
                  * l.S'Dd' * l.C(l.R('CF', 'cf') * byte) / function(high, low)
                      return char(0x10000 + ((tonumber(high, 16) & 0x3FF) << 10 | (tonumber(low, 16) & 0x3FF)))
                    end
  utf16hex_to_utf8 = l.Cs((surrogate + simple)^0)
end

-- First the non-harf case

-- Standard caching setup
local identity_table = setmetatable({}, {__index = function(_, id) return char(id) end})
local cached_text = setmetatable({}, {__index = function(t, fid)
  local fontdir = getfont(fid)
  local characters = fontdir and fontdir.tounicode == 1 and fontdir.characters
  local font_cache = characters and setmetatable({}, {__index = function(tt, slot)
    local character = characters[slot]
    local text = character and character.tounicode or slot
    -- At this point we have the tounicode value in text. This can have different forms.
    -- The order in the if ... elseif chain is based on how likely it is to encounter them.
    -- This is a small performance optimization.
    local t = type(text)
    if t == 'string' then
      text = utf16hex_to_utf8:match(text)
    elseif t == 'number' then
      text = char(text)
    elseif t == 'table' then
      text = char(unpack(text)) -- I haven't tested this case, but it should work
    end
    tt[slot] = text
    return text
  end}) or identity_table
  t[fid] = font_cache
  return font_cache
end})

-- Now the tounicode case just has to look up the value
local function from_tounicode(n)
  local slot, fid = is_glyph(n)
  return cached_text[fid][slot]
end

-- Now the traversing stuff. Nothing interesting to see here except for the
-- glyph case
local traverse = node.traverse
local glyph, glue, disc, hlist, vlist = node.id'glyph', node.id'glue', node.id'disc', node.id'hlist', node.id'vlist'
local extract_text_vlist
-- We could replace i by #t+1 but this should be slightly faster
local function real_extract_text(head, t, i)
  for n, id in traverse(head) do
    if id == glyph then
      -- First handle harf mode: Look for a glyph_info property. If that does not exists
      -- use from_tounicode. glyph_info will sometimes/often be an empty string. That's
      -- intentional and it should *not* trigger a fallback. The actual mapping will be
      -- contained in surrounding chars.
      local props = getproperty(n)
      t[i] = props and props.glyph_info or from_tounicode(n)
      i = i + 1
    elseif id == glue then
      if n.width > 1001 then -- 1001 is arbitrary but sufficiently high to be bigger than most weird glue uses
        t[i] = ' '
        i = i + 1
      end
    elseif id == disc then
      i = real_extract_text(n.replace, t, i)
    elseif id == hlist then
      i = real_extract_text(n.head, t, i)
    elseif id == vlist then
      i = extract_text_vlist(n.head, t, i)
    end
  end
  return i
end
function extract_text_vlist(head, t, i) -- glue should not become a space here
  for n, id in traverse(head) do
    if id == hlist then
      i = real_extract_text(n.head, t, i)
    elseif id == vlist then
      i = extract_text_vlist(n.head, t, i)
    end
  end
  return i
end
return function(list)
  local t = {}
  real_extract_text(list.head, t, 1)
  return table.concat(t)
end

This can be used as a normal Lua module:

\documentclass{article}
\usepackage{fontspec}

\newfontfamily{\devharf}{Noto Sans Devanagari}[Script=Devanagari, Renderer=HarfBuzz]
\newfontfamily{\devnode}{Noto Sans Devanagari}[Script=Devanagari, Renderer=Node]

\begin{document}

% Devanagari text is at the right end of following line
% of code, you might have to scroll right to read it
\setbox0=\hbox{Příliš žluťoučký \textit{kůň} úpěl \hbox{ďábelské} ódy difference diffierence. \devharf एक गांव -- में मोहन नाम का लड़का रहता था। उसके पिताजी एक मामूली मजदूर थे।}
\setbox1=\hbox{Příliš žluťoučký \textit{kůň} úpěl \hbox{ďábelské} ódy difference diffierence. \devnode एक गांव -- में मोहन नाम का लड़का रहता था। उसके पिताजी एक मामूली मजदूर थे।}

\directlua{
  local extracttext = require'extracttext'
  local f = io.open("hello.harf.txt","w") % Can reproduce the full input text
  f:write(extracttext(tex.getbox(0)))
  f:close()

  f = io.open("hello.node.txt","w") % In node mode, we only get an approximation
  f:write(extracttext(tex.getbox(1)))
  f:close()
}

\box0
\box1
\end{document}

A more general note: As you can see, there is some work involved when it comes to getting text from a shaped list, especially in the ToUnicode case where we have to map surrogate pairs and stuff. This is mostly because shaped text is not intended for such use. As soon as the glyph nodes are protected (aka. subtype(n) >= 256 or not is_char(n) is true), the .char entries no longer contain Unicode values but internal identifiers, the .font entry might no longer be the value you expect and some glyphs might not be represented as glyphs at all. In most cases in which you want to actually access the text behind a box and not just the visual display of the text, you really want to intercept the list before it gets shaped in the first place.