ucharclasses misbehaves with Spacing Modifier Letters and Combining Diacritical Marks

Mixing unicode blocks in words = humans writing; setting a font when entering a different unicode block (or leaving it) = ucharclasses.

So English and Vietnamese aren't distinguishable by which block a character belongs to, since they both share the Latin block. But English and Old Persian are distinguishable by character class.

The combining diacritical marks block is a different block to the Basic Latin one, so, yes, this is possible:

abc diacritics

and even this:

chicken

MWE

\documentclass[12pt]{article}
\usepackage[no-math]{fontspec}
\usepackage[BasicLatin, CombiningDiacriticalMarks]{ucharclasses}
\usepackage{xcolor}



\setmainfont{Noto Serif}
\newfontfamily\fdiac[Colour=red,Scale=1.5]{Fira Sans Black}

\setTransitionTo{BasicLatin}{\normalfont}
\setTransitionTo{CombiningDiacriticalMarks}{\fdiac}

\begin{document}
\large
a a\symbol{"0302} xyẑ abc \ \ o\symbol{"0302}\symbol{"0344}o\symbol{"0302}\symbol{"0321}\symbol{"0325}\symbol{"032C}

\end{document

"Disjoint" means that ucharclasses can produce only one output (at a time), not two or more, so that in turn means that the sets of characters to process should not overlap or share elements.

=== Edited to add:

These combining marks could be really useful.

baboon

The sign for the "Hm, oh, er, um, that's a really nice..." conversation filler, as used in polite baboon social interactions among deferential individuals, say.

\documentclass[12pt]{article}
\usepackage[no-math]{fontspec}
\usepackage[BasicLatin, CombiningDiacriticalMarks]{ucharclasses}
\usepackage{xcolor}



\setmainfont{Noto Serif}
\newfontfamily\fdiac[Colour=red,Scale=1.5]{Fira Sans Black}
\newfontfamily\fdiacb[Colour=blue,Scale=2.5]{Gentium Plus}


\setTransitionTo{BasicLatin}{\normalfont}
\setTransitionTo{CombiningDiacriticalMarks}{\fdiac}

\begin{document}
\large
 (o\symbol{"0302}\symbol{"032B}{\let\fdiac\fdiacb\symbol{"0308}\symbol{"036A}}o\symbol{"0302}\symbol{"0321}\symbol{"0325}\symbol{"032C})

\end{document}

Further edit

About transitioning -

On the presumption that transitioning requires a (sequential?) transition, insert a transition, using either {} or a zero-width joiner (both being outside the relevant code blocks):

non-Latin transitions

Diacritical mark and base character function as a unit (in a sense), font-wise, so insert a transition after the base character.

MWE

\documentclass{article}
\usepackage{xcolor}
\usepackage[Latin, Phonetics, Diacritics, SpacingModifierLetters]{ucharclasses}
\usepackage{fontspec}
\defaultfontfeatures{Scale=MatchLowercase,Mapping=tex-text}
\newfontfeature{IPA}{+mgrk}
\setmainfont[IPA]{DejaVu Sans}
\newfontfamily\dejavuserif[IPA]{DejaVu Serif}[Colour=red]

\setTransitionsFor{IPAExtensions}{\dejavuserif}{\normalfont}
\setTransitionsFor{CombiningDiacriticalMarks}{\dejavuserif}{\normalfont}
\setTransitionsFor{SpacingModifierLetters}{\dejavuserif}{\normalfont}

\newcommand\zwnj{^^^^200c}

\begin{document}

thaaw [tʰ{}ɑɑɯ] [tɑɑɯ] [tʰ{}ɑ́{}ɑɯ] [tɑ́{}ɑɯ] thaaw  
\normalfont

thaaw [t^^^^02b0\zwnj ɑɑɯ] [tɑɑɯ] [tʰ\zwnj ɑ́\zwnj ɑɯ] [tɑ́\zwnj ɑɯ] thaaw  


\end{document}

Although, keeping the units of meaning and display synchronized would be less of a cognitive load on the reader:

phonetics

MWE

\documentclass{article}
\usepackage{xcolor}
\usepackage{fontspec}
\defaultfontfeatures{Scale=MatchLowercase,Mapping=tex-text}
\newfontfeature{IPA}{+mgrk}
\setmainfont[IPA]{DejaVu Sans}
\newfontfamily\dejavuserif[IPA]{DejaVu Serif}[Colour=red]
\newcommand\ph[1]{[{\dejavuserif #1}]}

\begin{document}

thaaw \ph{tʰɑɑɯ} \ph{tɑɑɯ} \ph{tʰɑ́ɑɯ} \ph{tɑ́ɑɯ} thaaw  

\end{document}

On the matter of stacking diacritics, the font-designer's hand and choice comes into play.

Some random fonts, to illustrate:

Noto Serif

Noto Serif

Acariya

Acariya

Ajoure

Ajoure

Andika

Andika

Arial

Arial

DejaVu Serif

DejaVu Serif


Looping

Hypothesis: The root cause is that counting starts from 1, and then goes upwards. Once only. So the last font-switch command put into the typesetting stream is the one that has a visible effect.

What happens when A block text and B block text are typed next to each other with no separator(s), the A-B transition code loops through all the blocks, finds A is ending, outputs the "coming out of A block" code, finds B is starting, outputs the "going into B block" code - if A codeblock is examined first.

If the A codeblock has a higher Unicode start/end point than the B code block, the looping finds instead that: B block is starting, outputs the "coming into B block" code, finds the A block is ending, outputs the "coming out of A block" code, and the user is surprised: we have gone back to normal font (for example).

In real life, the normal separator between blocks (intended as script blocks) is a space (Latin), which Tex converts to glue - but ZW characters from the punctuation block, as above, can also act as 'separators' between other blocks (technically, classes, not blocks).

Higher classes trump lower classes.

Ideally, explicitly specifying all the entry/exit pairwise combinations of code block transitions (where the code blocks are contiguous text) would cover the general case - except for cross-Unicode block text.

higher unicode block trumps others if no separators are used

MWE

\documentclass{article}
\usepackage{xcolor}
\usepackage[Latin, Cyrillic, Cuneiform, Coptic]{ucharclasses}
\usepackage{fontspec}

\setmainfont{DejaVu Sans}
\newfontfamily\fa{Noto Sans Coptic}[Colour=red]
\newfontfamily\fb{Noto Serif}[Colour=blue]
\newfontfamily\fc{Noto Sans Cuneiform}[Colour=green]

\setTransitionsFor{Coptic}{\fa}{\normalfont}
\setTransitionsFor{Cyrillic}{\fb}{\normalfont}
\setTransitionsFor{Cuneiform}{\fc}{\normalfont}

\newcommand\zwnj{^^^^200c}

\begin{document}
ⲀⲁⲂⲃⲄⲅxАБВГДЕxxⲀⲁⲂⲃⲄⲅ


ⲀⲁⲂⲃⲄⲅАБВГДЕⲀⲁⲂⲃⲄⲅ


ⲀⲁⲂⲃⲄⲅ АБВГДЕ  ⲀⲁⲂⲃⲄⲅ




xАБВГДЕxⲀⲁⲂⲃⲄⲅxxⲀⲁⲂⲃⲄⲅ


АБВГДЕⲀⲁⲂⲃⲄⲅⲀⲁⲂⲃⲄⲅ


АБВГДЕ ⲀⲁⲂⲃⲄⲅ  ⲀⲁⲂⲃⲄⲅ


АБВГДЕⲀⲁⲂⲃⲄⲅⲀⲁⲂⲃⲄⲅ


 АБВГДЕ ⲀⲁⲂⲃⲄⲅ ⲀⲁⲂⲃⲄⲅ


\end{document}

Using \XeTeXinterchartoks transitions directly

Another way to have single-point transitions, instead of multiple, is to put (in this specific case) all three code blocks -- IPAExtensions, CombiningDiacriticalMarks, and SpacingModifierLetters -- into the same class; ucharclasses is not needed.

single class triple block

But that still leaves the semantic ambiguity that phonetic t and non-phonetic t are the same glyph.

MWE

(code adapted from an answer by Jonathon Kew on the TUG maillist 2008: here

\documentclass{article}
\usepackage{xcolor}
\usepackage{fontspec}
\defaultfontfeatures{Scale=MatchLowercase,Mapping=tex-text}
\newfontfeature{IPA}{+mgrk}
\setmainfont[IPA]{DejaVu Sans}
\newfontfamily\dejavuserif[IPA]{DejaVu Serif}[Colour=red]



\newcount\n
\n=`\ɐ \loop \XeTeXcharclass \n=4 \ifnum\n<`\ʯ \advance\n by 1 \repeat
%\n=`\a \loop \XeTeXcharclass \n=4 \ifnum\n<`\z \advance\n by 1 \repeat
\n=`\ʰ \loop \XeTeXcharclass \n=4 \ifnum\n<`\˿ \advance\n by 1 \repeat
\n=`\̀ \loop \XeTeXcharclass \n=4 \ifnum\n<`\ͯ \advance\n by 1 \repeat
% when we encounter class 4, we'll do \startling
\XeTeXinterchartoks 0 4 {\startling}
\XeTeXinterchartoks 4095 4 {\startling}
% and when we encounter class 0, we'll do \finishling
\XeTeXinterchartoks 4095 0 {\finishling}
\XeTeXinterchartoks 4 0 {\finishling}
%\newif\ifling
\newcommand\startling{\dejavuserif}
\newcommand\finishling{\normalfont}
\XeTeXinterchartokenstate=1

\begin{document}


thaaw [tʰɑɑɯ] [tɑɑɯ] [tʰɑ́ɑɯ] [tɑ́ɑɯ] thaaw  


\end{document}

Edit More on looping

Changing the sequence of the \setTransitionsFor commands affects the outcome:

ICS

CSI

CIS

etc


OK. ucharclasses wasn't designed for multiply-overlapping transitions: more of a 'into Greek, switch to Greek font; into Cyrillic, switch to a Cyrillic font; etc'.

The transitions are (leaving aside CJK matters, which take up classes 1,2,3):

(a) from/to class 0 (any glyph not defined in a class),

(b) from/to class 4095 (any non-glyph = glue, maths, boxes: collectively called 'boundary', as in word boundary; space becomes glue during typesetting, so that's why spaces are what I've been calling 'separators').

(c) any pair-wise transitions between user-defined classes (presumably 5,6,7,...)

So, reducing the complexity by having just three named ucharclasses, xxxClass, where xxx is the codeblock name (which makes things easier coding-wise, because we don't need to work out what the class numbers are), we have 12 'single' transitions: 3 into our classes from class 0, 3 into our classes from class 4095, and the 6 corresponding transitions out of our classes into classes 0 and 4095.

%singles ========================
%entering
%encountering our 3 classes
\XeTeXinterchartoks 0 \SpacingModifierLettersClass   = {\dejavuserif}
\XeTeXinterchartoks 0 \IPAExtensionsClass  = {\dejavuserif} 
\XeTeXinterchartoks 0 \CombiningDiacriticalMarksClass  = {\dejavuserif}

% glue, maths, boxes etc = `boundary'
\XeTeXinterchartoks 4095 \SpacingModifierLettersClass   = {\dejavuserif}
\XeTeXinterchartoks 4095 \IPAExtensionsClass   = {\dejavuserif} 
\XeTeXinterchartoks 4095 \CombiningDiacriticalMarksClass   = {\dejavuserif} 

%leaving
%encountering everything else
\XeTeXinterchartoks \SpacingModifierLettersClass 0  = {\normalfont}
\XeTeXinterchartoks \IPAExtensionsClass 0  = {\normalfont} 
\XeTeXinterchartoks \CombiningDiacriticalMarksClass 0  = {\normalfont}

% glue, maths, boxes etc = `boundary'
\XeTeXinterchartoks \SpacingModifierLettersClass 4095  = {\normalfont}
\XeTeXinterchartoks \IPAExtensionsClass 4095  = {\normalfont} 
\XeTeXinterchartoks \CombiningDiacriticalMarksClass 4095  = {\normalfont} 

Next, we have the pairwise-combinations of transitions into/out of our three classes, with respect to each other: 3x2=6 of them.

%pairs ===============
\XeTeXinterchartoks \SpacingModifierLettersClass \CombiningDiacriticalMarksClass  = {\dejavuserif}
\XeTeXinterchartoks \IPAExtensionsClass \CombiningDiacriticalMarksClass  = {\dejavuserif}

\XeTeXinterchartoks \SpacingModifierLettersClass \IPAExtensionsClass = {\dejavuserif}
\XeTeXinterchartoks \CombiningDiacriticalMarksClass \IPAExtensionsClass = {\dejavuserif}

\XeTeXinterchartoks \IPAExtensionsClass \SpacingModifierLettersClass = {\dejavuserif}
\XeTeXinterchartoks \CombiningDiacriticalMarksClass \SpacingModifierLettersClass = {\dejavuserif}

giving:

all permutations covered

Full MWE:

\documentclass[varwidth,border=6pt]{standalone}
\usepackage{xcolor}
\usepackage[
%Latin, 
%Phonetics, 
%Diacritics, 
SpacingModifierLetters, 
CombiningDiacriticalMarks, 
IPAExtensions,
]{ucharclasses}
\usepackage{fontspec}
\defaultfontfeatures{Scale=MatchLowercase,Mapping=tex-text}
\newfontfeature{IPA}{+mgrk}
\setmainfont[IPA]{DejaVu Sans}
\newfontfamily\dejavuserif[IPA]{DejaVu Serif}[Colour=red]

%\setTransitionsFor{CombiningDiacriticalMarks}{\dejavuserif}{\normalfont}
%\setTransitionsFor{SpacingModifierLetters}{\dejavuserif}{\normalfont}
%\setTransitionsFor{IPAExtensions}{\dejavuserif}{\normalfont}


%singles ========================
%entering
%encountering our 3 classes
\XeTeXinterchartoks 0 \SpacingModifierLettersClass   = {\dejavuserif}
\XeTeXinterchartoks 0 \IPAExtensionsClass  = {\dejavuserif} 
\XeTeXinterchartoks 0 \CombiningDiacriticalMarksClass  = {\dejavuserif}

% glue, maths, boxes etc = `boundary'
\XeTeXinterchartoks 4095 \SpacingModifierLettersClass   = {\dejavuserif}
\XeTeXinterchartoks 4095 \IPAExtensionsClass   = {\dejavuserif} 
\XeTeXinterchartoks 4095 \CombiningDiacriticalMarksClass   = {\dejavuserif} 

%leaving
%encountering everything else
\XeTeXinterchartoks \SpacingModifierLettersClass 0  = {\normalfont}
\XeTeXinterchartoks \IPAExtensionsClass 0  = {\normalfont} 
\XeTeXinterchartoks \CombiningDiacriticalMarksClass 0  = {\normalfont}

% glue, maths, boxes etc = `boundary'
\XeTeXinterchartoks \SpacingModifierLettersClass 4095  = {\normalfont}
\XeTeXinterchartoks \IPAExtensionsClass 4095  = {\normalfont} 
\XeTeXinterchartoks \CombiningDiacriticalMarksClass 4095  = {\normalfont} 

%pairs ===============
\XeTeXinterchartoks \SpacingModifierLettersClass \CombiningDiacriticalMarksClass  = {\dejavuserif}
\XeTeXinterchartoks \IPAExtensionsClass \CombiningDiacriticalMarksClass  = {\dejavuserif}

\XeTeXinterchartoks \SpacingModifierLettersClass \IPAExtensionsClass = {\dejavuserif}
\XeTeXinterchartoks \CombiningDiacriticalMarksClass \IPAExtensionsClass = {\dejavuserif}

\XeTeXinterchartoks \IPAExtensionsClass \SpacingModifierLettersClass = {\dejavuserif}
\XeTeXinterchartoks \CombiningDiacriticalMarksClass \SpacingModifierLettersClass = {\dejavuserif}



\begin{document}


thaaw [tʰɑɑɯ] [tɑɑɯ] [tʰɑ́ɑɯ] [tɑ́ɑɯ] [tʰɑ́ɑɯʰ] thaaw  

\end{document}

Adding additional blocks/classes - like Latin or Phonetics - will increase the number of combinations and permutations to be covered (if space, or something else from class 4095 or what remains of class 0, is not going to be used to activate a transition event).


In that multi-coloured example using Coptic, Cyrillic, Cuneiform and Latin, the lines with the text strings next to each other with no spaces were corrected by specifying all the combinations:

all combinations

Classes 4,5,6,7 were arbitarily used to (manually) class up the glyphs.

MWE

\documentclass[varwidth,border=6pt]{standalone}
\usepackage{xcolor}
\usepackage{fontspec}

\setmainfont{DejaVu Sans}
\newfontfamily\fa{Noto Sans Coptic}[Colour=red]
\newfontfamily\fb{Noto Serif}[Colour=blue]
\newfontfamily\fc{Noto Sans Cuneiform}[Colour=green]

\newcount\n
%===
%latin
\n=`\A \loop \XeTeXcharclass \n=4 \ifnum\n<`\Z \advance\n by 1 \repeat
\n=`\a \loop \XeTeXcharclass \n=4 \ifnum\n<`\z \advance\n by 1 \repeat
% when we encounter class 4, we'll do \startling
\XeTeXinterchartoks 0 4 {\startling}
\XeTeXinterchartoks 4095 4 {\startling}
% and when we encounter class 0, we'll do \finishling
\XeTeXinterchartoks 4095 0 {\finishling}
\XeTeXinterchartoks 4 0 {\finishling}
%\newif\ifling
\newcommand\startling{\normalfont}
\newcommand\finishling{}

%===
%cyrillic
\n=`\Ѐ \loop \XeTeXcharclass \n=5 \ifnum\n<`\ӿ \advance\n by 1 \repeat

% when we encounter class 5, we'll do \startling
\XeTeXinterchartoks 0 5 {\startlingcyr}
\XeTeXinterchartoks 4095 5 {\startlingcyr}
% and when we encounter class 0, we'll do \finishling
%\XeTeXinterchartoks 4095 0 {\finishlingcyr}
\XeTeXinterchartoks 5 0 {\finishlingcyr}
%\newif\ifling
\newcommand\startlingcyr{\fb}
\newcommand\finishlingcyr{\normalfont}
%===
%cuneiform
\n="12000 \loop \XeTeXcharclass \n=6 \ifnum\n<"12399 \advance\n by 1 \repeat
\XeTeXinterchartoks 0 6 {\startlingcun}
\XeTeXinterchartoks 4095 6 {\startlingcun}
\XeTeXinterchartoks 6 0 {\finishlingcun}
\newcommand\startlingcun{\fc}
\newcommand\finishlingcun{\normalfont}
%===
%coptic
\n=`\Ⲁ \loop \XeTeXcharclass \n=7 \ifnum\n<`\⳿ \advance\n by 1 \repeat
\XeTeXinterchartoks 0 7 {\startlingcop}
\XeTeXinterchartoks 1 7 {\startlingcop}
\XeTeXinterchartoks 2 7 {\startlingcop}
\XeTeXinterchartoks 3 7 {\startlingcop}
\XeTeXinterchartoks 5 7 {\startlingcop}
\XeTeXinterchartoks 6 7 {\startlingcop}
\XeTeXinterchartoks 4095 7 {\startlingcop}
\XeTeXinterchartoks 4095 0 {\finishlingcop}
\XeTeXinterchartoks 5 6 {\finishlingcyrc}
\XeTeXinterchartoks 7 0 {\finishlingcop}
\XeTeXinterchartoks 7 5 {\finishlingcopb}
\XeTeXinterchartoks 6 5 {\finishlingcopb}
\XeTeXinterchartoks 7 6 {\finishlingc}
\XeTeXinterchartoks 4 5 {\finishlingcopb}
\XeTeXinterchartoks 4 6 {\finishlingc}
\XeTeXinterchartoks 4 7 {\startlingcop}
\XeTeXinterchartoks 7 4 {\startling}
\XeTeXinterchartoks 6 4 {\startling}
\XeTeXinterchartoks 5 4 {\startling}

\newcommand\startlingcop{\fa}
\newcommand\finishlingcop{}
\newcommand\finishlingcyrc{\fc}
\newcommand\finishlingc{\fc}
\newcommand\finishlingcopb{\fb}

\XeTeXinterchartokenstate=1
\begin{document}
ⲀⲁⲂⲃⲄⲅxАБВГДЕxxⲀⲁⲂⲃⲄⲅ


ⲀⲁⲂⲃⲄⲅАБВГДЕⲀⲁⲂⲃⲄⲅ


ⲀⲁⲂⲃⲄⲅ АБВГДЕ  ⲀⲁⲂⲃⲄⲅ




xАБВГДЕxⲀⲁⲂⲃⲄⲅxxⲀⲁⲂⲃⲄⲅ


АБВГДЕⲀⲁⲂⲃⲄⲅⲀⲁⲂⲃⲄⲅ


АБВГДЕ ⲀⲁⲂⲃⲄⲅ  ⲀⲁⲂⲃⲄⲅ


АБВГДЕⲀⲁⲂⲃⲄⲅⲀⲁⲂⲃⲄⲅ


 АБВГДЕ ⲀⲁⲂⲃⲄⲅ ⲀⲁⲂⲃⲄⲅ


\end{document}