Make ascii string from utf8 input

If you are using inputenc (rather than XeLaTeX/LuaLaTeX) you can take advantage of the fact that inputenc turns the extended characters into accent commands. For example, ü expands to \IeC{\"u}. So you can temporarily redefine the accent commands to strip them out.

Example:

\documentclass{article}
\usepackage[utf8]{inputenc}

\makeatletter
\newcommand{\stripaccents}[2]{%
 \begingroup
  % strip accents:
  \let\add@accent\@secondoftwo
  % provide replacement strings:
  \def\AE{AE}%
  \def\ae{ae}%
  \def\OE{OE}%
  \def\oe{oe}%
  \def\AA{AA}%
  \def\aa{aa}%
  \def\L{L}%
  \def\l{l}%
  \def\O{O}%
  \def\o{o}%
  \def\SS{SS}%
  \def\ss{ss}%
  \def\th{th}%
  \def\TH{TH}%
  \def\dh{dh}%
  \def\DH{DH}%
  \xdef#1{#2}%
 \endgroup
}
\makeatother

\begin{document}
\stripaccents\tmp{æüßéñ}
\show\tmp
\end{document}

This shows:

> \tmp=macro:
->aeussen.

If you have any other commands that are likely to occur in your input, you'll need to add them to \stripaccents so that they expand into something sensible.

For the umlauts, you could temporarily redefine \" so that it appends e to its argument:

\newcommand{\stripaccents}[2]{%
 \begingroup
  \def\"##1{##1e}% umlaut
  \let\add@accent\@secondoftwo
  \def\AE{AE}%
  \def\ae{ae}%
  \def\OE{OE}%
  \def\oe{oe}%
  \def\AA{AA}%
  \def\aa{aa}%
  \def\L{L}%
  \def\l{l}%
  \def\O{O}%
  \def\o{o}%
  \def\SS{SS}%
  \def\ss{ss}%
  \def\th{th}%
  \def\TH{TH}%
  \def\dh{dh}%
  \def\DH{DH}%
  \xdef#1{#2}%
 \endgroup
}

This now shows:

> \tmp=macro:
->aeuessen.

With T1 encoding you also need:

\let\@text@composite@x\@secondoftwo

in the definition of \stripaccents, as mentioned in your comment.


You have to populate the list yourself, according to the given examples.

\documentclass{article}
\usepackage{xparse}

\ExplSyntaxOn
\seq_new:N \g_schtandard_search_replace_seq

\seq_gput_right:Nn \g_schtandard_search_replace_seq { {æ}{ae} }
\seq_gput_right:Nn \g_schtandard_search_replace_seq { {ä}{ae} }
\seq_gput_right:Nn \g_schtandard_search_replace_seq { {ö}{oe} }
\seq_gput_right:Nn \g_schtandard_search_replace_seq { {ü}{ue} }
\seq_gput_right:Nn \g_schtandard_search_replace_seq { {ß}{ss} }
\seq_gput_right:Nn \g_schtandard_search_replace_seq { {ñ}{n} }
\seq_gput_right:Nn \g_schtandard_search_replace_seq { {é}{e} }

\tl_new:N \l_schtandard_input_tl

\NewDocumentCommand{\makestring}{om}
 {
  \tl_set:Nn \l_schtandard_input_tl { #2 }
  \seq_map_inline:Nn \g_schtandard_search_replace_seq
   {
    \regex_replace_all:nnN ##1 \l_schtandard_input_tl
   }
  \IfNoValueTF{#1}
   {
    \tl_use:N \l_schtandard_input_tl
   }
   {
    \tl_set_eq:NN #1 \l_schtandard_input_tl
   }
 }
\ExplSyntaxOff

\begin{document}

\makestring{æüßéñ}

\makestring[\foo]{æüßéñ}\texttt{\meaning\foo}

\end{document}

enter image description here


If you only need filenames, but do not need them to be "human readable", then you could take advantage of \pdfstringdef

\documentclass{article}

\usepackage[utf8]{inputenc}
\usepackage[T1]{fontenc}
\usepackage[unicode]{hyperref}

\makeatletter
\begingroup
\catcode`| 0 \catcode`\\ 12
|gdef|makestring@i\#1#2#3#4%
     {#1#2#3|if|relax#4|expandafter|@gobbletwo|fi|makestring@i#4}
|endgroup
\newcommand*{\makestring}[2]{%
   \pdfstringdef\makestring@{#2}%
   \edef#1{\expandafter\makestring@i\makestring@\relax}%
}
\makeatother



\begin{document}

\makestring{\foo}{æüßéñ}

\texttt{\meaning\foo}

\end{document}

enter image description here


A variation on this theme which is much more efficient, it show the utf8 bytes. One could produce in hexadecimal if desired. (in fact there are possibly macros in utf8.def which could be used here)

\documentclass{article}

\usepackage[utf8]{inputenc}
\usepackage[T1]{fontenc}

\makeatletter
\newcommand*\MakeString[2]{%
   \begingroup
    \def\UTFviii@two@octets##1##2{\the\numexpr`##1\relax\the\numexpr`##2}%
    \def\UTFviii@three@octets##1##2##3{\the\numexpr`##1\relax\the\numexpr`##2\relax\the\numexpr`##3\relax}%
    \def\UTFviii@four@octets##1##2##3##4{\the\numexpr`##1\relax\the\numexpr`##2\relax\the\numexpr`##3\relax\the\numexpr`##4\relax}%
   \xdef#1{#2}%
  \endgroup
}
\makeatother

\begin{document}

\MakeString{\foo}{æüßéñ}

\texttt{\meaning\foo}

\show\foo
\end{document}

Produces:

> \foo=macro:
->195166195188195159195169195177.
l.23 \show\foo

I should improve so that each byte produce a three-digits decimal, here leading zeros are stripped!


Ok here it is with no stripping and 2-hex digits per byte.

edit removed usage of extra package. Defined \Byte@tohex macro possibly already provided by utf8-inputenc internally, not checked.

\documentclass{article}

\usepackage[utf8]{inputenc}
\usepackage[T1]{fontenc}

\makeatletter
% I have not checked but maybe utf8-inputenc provides already
% similar macro (not even using e-TeX)
\def\Byte@tohex #1%
   {\expandafter
    \Byte@tohex@\the\numexpr(`#1+8)/16-1\expandafter
    .\the\numexpr`#1.}%
\def\Byte@tohex@ #1.#2.%
   {\Byte@onehex #1.%
    \expandafter\Byte@onehex\the\numexpr #2-16*#1.%
   }
\def\Byte@onehex #1.%
   {\ifcase #1
    0\or1\or2\or3\or4\or5\or6\or7\or8\or9%
     \or A\or B\or C\or D\or E\or F%
   \fi
   }%
\newcommand*\MakeString[2]{%
   \begingroup
    \def\UTFviii@two@octets##1##2{\Byte@tohex{##1}\Byte@tohex{##2}}%
    \def\UTFviii@three@octets##1##2##3{\Byte@tohex{##1}\Byte@tohex{##2}\Byte@tohex{##3}}%
    \def\UTFviii@four@octets##1##2##3##4{\Byte@tohex{##1}\Byte@tohex{##2}\Byte@tohex{##3}\Byte@tohex{##4}}%
   \xdef#1{#2}%
  \endgroup
}
\makeatother
\begin{document}

\MakeString{\foo}{æüßéñ}

\texttt{\meaning\foo}

\show\foo
\end{document}

produces in log

> \foo=macro:
->C3A6C3BCC39FC3A9C3B1.
l.27 \show\foo

Tags:

Macros