How to get string length from multi-byte character string when passed as a \def using pdftex

You just have to expand once #1 in the context of the \zz definition. This will allow the first argument to be a macro containing the actual data of interest.

\documentclass[a4paper,11pt]{article}

% Attempt to get the length of a utf8 multi byte string
% Only works when supplied with the string directly
% does not work with \def strings
%
% https://tex.stackexchange.com/questions/419215/multibyte-strlen-strlen-for-chinese-characters
%
\def\zz#1{\edef\theresult{\expandafter\zzz\expandafter0#1\relax}}
\def\zzz#1#2{%
    \ifx\relax#2 \the\numexpr#1\relax
    \else
    \expandafter\zzz\expandafter{%
        \the\numexpr(#1+\ifnum\expandafter`\string#2<"80 1\else \ifnum\expandafter`\string#2>"BF 1 \else 0 \fi\fi
        \expandafter)\expandafter\relax\expandafter}%
\fi}%

\begin{document}
    \def\v1{abc}
    \v1 \zz{abc} \theresult\\   %this works
    \zz{\v1} \theresult\\       %this doesn't work Error: Missing = inserted for \ifnum.
\end{document}

enter image description here

p.s. The use of non alpha symbols are not generally a good practice in user code macro names, such as \v1. While it may appear the variable is named \v1, it is in fact named \v, requiring a mandatory argument of 1.


SUPPLEMENT

Here's a different approach that, while not expandable, counts spaces, group tokens, unexpanded macro tokens, as well as characters. It will expand the argument only if it is a single token, which covers the case of interest to the OP.

It counts individual tokens inside groups, rather than treating the group as a single "token".

\documentclass[a4paper,11pt]{article}
\usepackage{tokcycle}
\newcounter{mycount}
\tokcycleenvironment\countenv
{\stepcounter{mycount}}
{\addtocounter{mycount}{2}\processtoks{##1}}
{\stepcounter{mycount}}
{\stepcounter{mycount}}
\newcommand\countem[1]{%
  \setcounter{mycount}{0}%
  \countenv#1\endcountenv
  \ifnum\themycount=1\relax
    \setcounter{mycount}{0}%
    \expandafter\countenv#1\endcountenv
  \fi
  \themycount
}

\begin{document}
\def\v{abc}

\countem{abc}

\countem{\v}

\countem{\v2345}

\countem{a b{c{\today}e}fg}
\end{document}

enter image description here


Here's a modification of David Carlisle's \zz macro at https://tex.stackexchange.com/a/419216/4427 that distinguishes if the argument of \utfstrlen is a single macro or not. Don't try squeezing two macros in the argument: either a single macro representing a string of UTF-8 characters or an explicit list of characters.

\documentclass{article}
%\usepackage{xparse} % uncomment if using LaTeX release prior to 2020-10-01

\ExplSyntaxOn

\NewExpandableDocumentCommand{\utfstrlen}{m}
 {
  \egreg_utf_str_len:n { #1 }
 }

\cs_generate_variant:Nn \tl_to_str:n { e }

\cs_new:Nn \egreg_utf_str_len:n
 {
  \bool_lazy_and:nnTF { \tl_if_single_p:n { #1 } } { \token_if_cs_p:N #1 }
   {% #1 is a single control sequence
    \__egreg_utf_str_len:e { \tl_to_str:e { \exp_not:V #1 } }
   }
   {% #1 is a list of characters
    \__egreg_utf_str_len:e { \tl_to_str:n { #1 } }
   }
 }

\cs_new:Nn \__egreg_utf_str_len:n
 {
  \int_eval:n { \tl_map_function:nN { #1 } \__egreg_utf_char:n }
 }
\cs_generate_variant:Nn \__egreg_utf_str_len:n { e }

\cs_new:Nn \__egreg_utf_char:n
 {
  \int_compare:nTF { `#1 < "80 }
   { +1 } % ascii 7-bit
   { \int_compare:nT { `#1 > "BF } { +1 } } % prefix character
 }



\ExplSyntaxOff

\begin{document}

\utfstrlen{容容}

\utfstrlen{abc}

\utfstrlen{¢Àïα}

\def\test{容容abc¢Àïα}

\utfstrlen{\test}

\end{document}

enter image description here

Like in David's answer the list of characters (possibly obtained by expanding once a macro) is parsed one character at a time; if a 7-bit ASCII character is found, 1 is added; if a character with higher code than hexadecimal BF is found, it is a prefix for a multibyte character, so 1 is added; otherwise the character is ignored.

The macro \utfstrlen is fully expandable.

Tags:

Strings