XeTeX text layout strangely sensitive to spaces

I don't have the first or last font. However, Polyglossia works correctly for me. (I assume it would probably also work with just the correct font configuration, but I did it this way as this is presumably what you want in the end.)

\documentclass{article}
\usepackage{polyglossia}
\setmainlanguage{kannada}
\setotherlanguage[variant=british]{english}
\newfontfamily\kannadafont{Noto Serif Kannada}[Script=Kannada]
\newfontfamily\kannadafontsf{Noto Sans Kannada}[Script=Kannada]
\tracinglostchars=2 % https://tex.stackexchange.com/a/41235/48
\def\testtext{R ಶ್ರೀವತ್ಸ \quad Rಶ್ರೀವತ್ಸ}

\begin{document}

% \fontspec{Arial Unicode MS} \testtext

\testtext

\sffamily \testtext

% \fontspec{Kedage} \testtext

\end{document}

font configuration


(Sharing what I understood as a result of all this.)

Solutions

Firstly, the solutions to the problem:

  • As @cfr's answer pointed out, I should have used [Script=Kannada] for this font, as documented in the fontspec and polyglossia manuals. And when it's used, everything works as expected: with the space or without, the whole text is rendered as appropriate for the Kannada script.
  • Additionally, we actually don't want the non-Kannada characters like the R rendered in the Kannada script: the different-script characters like R must be marked as being in a different language or at least a different font (see below for how to do this).

So is this a bug, either in XeTeX or some library it uses? No, I'd say it's a user error. Still, the fact that everything works fine when there are spaces between words (without having to specify the script) perhaps makes this user error more likely.

Explanation

What explains this discrepancy in behaviour depending on the space (just what is going on)? And can this behaviour be changed in XeTeX? What I found is the following.

The library used by XeTeX for text layout, namely HarfBuzz (which is used in Firefox, Chrome, LibreOffice, etc., see What is Harfbuzz?), comes with a command-line program called hb-view which can be invoked with a font and a string of text. With it I get the following output:

  • hb-view NotoSansKannada-Regular.ttf "ಶ್ರೀ" and with --script=knda:

    without script, no space with script, space

  • hb-view NotoSansKannada-Regular.ttf " ಶ್ರೀ" and with --script=knda:

    without script, starts with space with script, starts with space

  • hb-view NotoSansKannada-Regular.ttf "Rಶ್ರೀ" and with --script=knda

    without script, starts with R no space with script, starts with R no space

  • hb-view NotoSansKannada-Regular.ttf "R ಶ್ರೀ" and with --script=knda

    without script, starts with R and space with script, starts with R and space

What this shows is that the output is correct if either the first non-space character is from the right script, or the script is specified explicitly.

So the behaviour seen in XeTeX (the difference between "Rಶ್ರೀ" and "R ಶ್ರೀ") is explained by what @Ulrike Fischer pointed out in The XeTeX companion:

XeTeX’s approach is the following:

  • the typesetting process collects runs of characters (words) whose widths are obtained via the API to the system libraries […] to determine the widths,

  • a XeTeX paragraph is a sequence of word nodes separated by glue.

Thus XeTeX’s typesetting engine places words rather than glyphs, the latter being drawn by the font rendering engine.

(The “system libraries” and “font rendering engine” above are HarfBuzz now (thanks to Khaled Hosny); they used to be ICU earlier.) So

  • with “Rಶ್ರೀವತ್ಸ”, XeTeX asks HarfBuzz to render that whole string as one unit, which fails (as seen in the hb-view experiments above) because it neither starts with a character from the desired script nor did we specify the script correctly, while

  • with “R ಶ್ರೀವತ್ಸ”, XeTeX asks HarfBuzz separately for each of the two words, and in this case the second word is correctly rendered (even if we didn't specify the script) because it starts with a character from the correct script.

Still it seems best not to rely on such guessing, and specify the script explicitly.

Working with both scripts

To have both scripts work smoothly, we ought to specify that the characters like R are in a different language. We could do this by writing \textenglish{R}ಶ್ರೀವತ್ಸ instead of Rಶ್ರೀವತ್ಸ. If we don't want to change the input though, there is a way to do this using the ucharclasses package.

I wasn't able to get it to work for some reason, so I just did it manually (referring to the example in texdoc xetex and a post from the author of ucharclasses, and with 255 changed to 4095 as mentioned in for example this answer):

\documentclass{article}
\usepackage{fontspec}
\usepackage{polyglossia}

\newfontfamily\kannadafont{Noto Serif Kannada}[Script=Kannada]
\newfontfamily\englishfont{Georgia}
\setdefaultlanguage{kannada}
\setotherlanguage{english}

\XeTeXinterchartokenstate = 1   % Enable the character classes functionality

\newXeTeXintercharclass \CharEnglish
\XeTeXcharclass `R = \CharEnglish

\XeTeXinterchartoks 0 \CharEnglish = {\selectlanguage{english}}
\XeTeXinterchartoks 4095 \CharEnglish = {\selectlanguage{english}}
\XeTeXinterchartoks \CharEnglish 0 = {\selectlanguage{kannada}}
\XeTeXinterchartoks \CharEnglish 4095 = {\selectlanguage{kannada}}

\begin{document}

R ಶ್ರೀವತ್ಸ \quad Rಶ್ರೀವತ್ಸ

\end{document}

This changes the language every time we move between an English character (only R above) and either a word boundary (4095) or a regular (not specified to be English) character (0).

For my original document, to deal with all the English characters, I wrote a loop to do the equivalent of

\XeTeXcharclass `R = \CharEnglish

for every uppercase and lowercase letter of the alphabet:

\newcount\tmpchar
\tmpchar = `A
\loop
  \ifnum \tmpchar < `[          % [ comes just after Z
    \XeTeXcharclass \tmpchar = \CharEnglish
    \XeTeXcharclass \lccode \tmpchar = \CharEnglish
    \advance \tmpchar by 1
\repeat

Tags:

Fonts

Indic

Xetex