input tokens through the unicode position

udpated:

Since 2018/04/01 LaTeX release, inputenc+utf8 is active (sic) by default, and this broke the v3 approach below, because the ascii bytes in range 128-255 are now active and they were assumed here of catcode 12.

I have made a small edit to my v3 code to let it be usable both with or without inputenc loaded for old LaTeX or with new LaTeX.

(pdflatex only answer)

Table of contents

  • v1 and v2 exploit already existing macros from the LaTeX kernel to produce the macros ultimately expected by the LaTeX font selection system.

  • v3 is completely different. It produces expandably the UTF-8 bytes. It does not use inputenc at all but does produce active leading bytes so that it is "inputenc+utf8" ready. The active tokens are protected using \protect (no e-TeX \protected here), and to write them to file stream one needs to issue \set@display@protect. They can be use inside an \edef, but need the LaTeX \protected@edef for that, for the same reason. Also macro \UniToUTF now admits numerical input, so that the " must be explicitely used in case of hexadecimal input (I do not understand why LaTeX's \DeclareUnicodeCharacter contrarily to utf8x's one does not do that either). The advantage here is that it is easy to set-up loops producing such characters in a row using an associated counter or other devices. For the sake of illustrating this, the mwe uses package xinttools but this is only for the examples.


v1

Here is quickly jotted down hack which seems to achieve with pdflatex what you want. May need some corrections for working in general, (short on time today).

\documentclass{article}
\usepackage[utf8]{inputenc}
\usepackage[T1]{fontenc}

\DeclareUnicodeCharacter{20AC}{Hallo}

\makeatletter
\newcommand\unicodetoutf[1]{{\count@ "#1 \parse@XML@charref}%
   \UTFviii@tmp}
\let\@preamblecmds\@empty
\makeatother


\begin{document}

€ ^^e2^^82^^ac \unicodetoutf{20AC} %<-- should output Hallo

\end{document}

enter image description here

v2

Here is edit to address issues about avoiding \let\@preamblecmds\@empty and allowing ascii chars:

\documentclass{article}
\usepackage[utf8]{inputenc}
\usepackage[LGR,T1]{fontenc}

%\tracingmacros1
\DeclareUnicodeCharacter{20AC}{Hallo}
%\tracingmacros0

\makeatletter
\let\UF@parse@UTFviii@a\parse@UTFviii@a
\let\UF@parse@UTFviii@b\parse@UTFviii@b
\newcommand*\UF@parse@XML@charref{%
  \ifnum\count@<"A0\relax
     \xdef\UTFviii@tmp{\char\the\count@\space}% \count@ not globally modified
  \else\ifnum\count@<"800\relax
     \UF@parse@UTFviii@a,%
     \UF@parse@UTFviii@b C\UTFviii@two@octets.,%
  \else\ifnum\count@<"10000\relax
     \UF@parse@UTFviii@a;%
     \UF@parse@UTFviii@a,%
     \UF@parse@UTFviii@b E\UTFviii@three@octets.{,;}%
   \else
     \UF@parse@UTFviii@a;%
     \UF@parse@UTFviii@a,%
     \UF@parse@UTFviii@a!%
     \UF@parse@UTFviii@b F\UTFviii@four@octets.{!,;}%
    \fi
    \fi
  \fi
}
\newcommand\unicodetoutf[1]{%
  {\count@ "#1 \UF@parse@XML@charref}%
   \UTFviii@tmp
}

\makeatother

\begin{document}

€ ^^e2^^82^^ac \unicodetoutf{20AC} %<-- should output Hallo

A ^^41 \unicodetoutf{0041}

e ^^65 \unicodetoutf{0065}

é ^^c3^^a9 \unicodetoutf{00E9}

\fontencoding{LGR}\selectfont

α ^^ce^^b1 \unicodetoutf{03B1}
\end{document}

enter image description here

The idea is to plug into the utf8.def Unicode to UTF-8 already coded parsing by the LaTeX team.


v3

Here is the expandable method. I long hesitated with the 7bit ascii range and the problems with catcodes, to start with many of them are declared ignored tokens by TeX/LaTeX. In the end I decided that all will be produced with catcode 12. However, when writing out to a file, they will nevertheless be sometimes output in TeX's core ^^ notation, not as genuine UTF-8 7bit ascii bytes.

It is not completely obvious to set-up macros expanding to all those tokens as catcode 12 tokens, I have chosen a rather systematic way rather than use shortcuts.

Please check out the code comments for more details.

\documentclass{article}

\makeatletter

% First, the most complicated, define macros extending to catcode 12
% tokens in the 7bit ascii range. Unfortunately, when writing out
% to a file, some of them will get converted back to ^^ notation, and
% not be written there as UTF-8 bytes.

%% 7bit ASCII RANGE (with all its TeX catcode complications)

%% First we handle the upper part (ascii code 127 handled later)

\begingroup
\catcode`\^ 12
\catcode`\_ 12
\catcode`\{ 12
\catcode`\} 12
\catcode`\~ 12
\catcode`\? 0
\catcode`\< 1
\catcode`\> 2
\catcode`\\ 12
?def?y<?endgroup
?count@ 64
?@tfor?x:=%
@ABCDEFGHIJKLMNO%
PQRSTUVWXYZ[\]^_%
`abcdefghijklmno%
pqrstuvwxyz{|}~?do<%
    ?expandafter?xdef?csname utf8byte?the?count@?endcsname<?expandafter?string?x>%
    ?advance?count@?@ne>%
>?y

%% Now we handle the lower part

\begingroup
\catcode127 14
\def\z{utf8byte}%
\def\x{%
\count@ 0
\loop
  \catcode\count@ 12
\advance\count@\@ne
\ifnum\count@<64
\repeat}\x^^?
\def\y{^^?
\count@=0\relax^^?
\@tfor\x:=^^?
^^@^^A^^B^^C^^D^^E^^F^^G^^H^^I^^J^^K^^L^^M^^N^^O^^?
^^P^^Q^^R^^S^^T^^U^^V^^W^^X^^Y^^Z^^[^^\^^]^^^^^_^^?
^^`^^a^^b^^c^^d^^e^^f^^g^^h^^i^^j^^k^^l^^m^^n^^o^^?
^^p^^q^^r^^s^^t^^u^^v^^w^^x^^y^^z^^{^^|^^}^^~\do{^^?
\expandafter\xdef\csname\z\the\count@\endcsname{\x}^^?
\advance\count@\@ne}}^^?
\y\endgroup

%% ascii codes 63 and 127 were not handled yet

\@namedef{utf8byte63}{?}%

\begingroup
\catcode127 12
\global\@namedef{utf8byte127}{^^?}%
\endgroup

%% Now, the easier part, we define the UTF-8 continuation bytes
%% (catcode 12) and the UTF-8 leading bytes (as \protect'ed active
%% tokens). This set-up is ready for utf8+inputenc but does not need
%% it for its definitions. (this document does not load inputenc)
%% UPDATE: but LaTeX newer than 2018/04/01 loads inputenc+utf8 per
%% default, so a small change has been made below.

%% CONTINUATION BYTES

\begingroup
% prior to LaTeX 2018/04/01 these 64 bytes were assigned catcode12
% by LaTeX in absence of inputenc; but nowadays these bytes are active
% by default. Hence I modified \x into \expandafter\string\x which
% works whether or not the document incorporating this code is with 
% old (with or without utf8+inputenc) LaTeX or with newer LaTeX
\count@ "80
\@tfor\x:=%
^^80^^81^^82^^83^^84^^85^^86^^87^^88^^89^^8a^^8b^^8c^^8d^^8e^^8f%
^^90^^91^^92^^93^^94^^95^^96^^97^^98^^99^^9a^^9b^^9c^^9d^^9e^^9f%
^^a0^^a1^^a2^^a3^^a4^^a5^^a6^^a7^^a8^^a9^^aa^^ab^^ac^^ad^^ae^^af%
^^b0^^b1^^b2^^b3^^b4^^b5^^b6^^b7^^b8^^b9^^ba^^bb^^bc^^bd^^be^^bf%
\do{\expandafter\xdef\csname utf8byte\the\count@\endcsname{\expandafter\string\x}%
    \advance\count@\@ne}%

%% LEADING BYTES

%% we will make them TeX active, as inputenc does with utf8
%% We \protect them à la LaTeX, and writing to a file will
%% have to be made in a \set@display@protect context

\count@ "C2
\loop
\catcode\count@\active
\advance\count@ 1
\ifnum\count@ < "F5
\repeat

\count@ "C2

\@tfor\x:=^^c2^^c3^^c4^^c5^^c6^^c7^^c8^^c9^^ca^^cb^^cc^^cd^^ce^^cf%
^^d0^^d1^^d2^^d3^^d4^^d5^^d6^^d7^^d8^^d9^^da^^db^^dc^^dd^^de^^df%
^^e0^^e1^^e2^^e3^^e4^^e5^^e6^^e7^^e8^^e9^^ea^^eb^^ec^^ed^^ee^^ef%
^^f0^^f1^^f2^^f3^^f4%
\do{\expandafter\protected@xdef
      \csname utf8byte\the\count@\endcsname{\expandafter\protect\x}%
    \advance\count@\@ne}%

\endgroup

%% Time now to define our expandable Unicode to UTF-8 converter

%% a "case-switch" utility
\long\def\xintdothis #1#2\xintorthat #3{\fi #1}%
\let\xintorthat \@firstofone

%% EXPANDABLE UNICODE TO UTF-8 BYTES CONVERTER

%% The macro accepts any numexpr compatible input.
%% This means hexadecimal input must be prefixed by "
%% (and must use uppercase hex digits.)
%% Advantage with such numerical inputs is to allow easy usage
%% over some arithmetic range with counts or counters.

%% NOTE: \numexpr is the only e-TeX extension used.

\newcommand*\UniToUTF[1]{\expandafter\UniToUTF@\the\numexpr#1.}%

\def\UniToUTF@ #1.{%
  \ifnum#1<"80 \xintdothis\uni@ascii\fi
  \ifnum#1<"800 \xintdothis\uni@twooctets\fi
  \ifnum#1<"10000 \xintdothis\uni@threeoctets\fi
%% maybe add here some out of range check ?
  \xintorthat\uni@fouroctets {#1}%
}

%% CONVERSION FORMULAS MORALLY BASED ON EXPLANATIONS OF
%% https://en.wikipedia.org/wiki/UTF-8
%% USING OCTAL NOTATION

\def\uni@ascii#1{\csname utf8byte#1\endcsname}%

\def\uni@twooctets#1{\expandafter\uni@twooctets@i
    \the\numexpr(#1+32)/64-1.#1.}%
\def\uni@twooctets@i#1.#2.{%
    \csname utf8byte\the\numexpr192+#1\expandafter\expandafter\expandafter
    \endcsname
    \csname utf8byte\the\numexpr128+#2-#1*64\endcsname
}%

\def\uni@threeoctets#1{\expandafter\uni@threeoctets@i
    \the\numexpr(#1+32)/64-1.#1.}%
\def\uni@threeoctets@i#1.#2.{\expandafter\uni@threeoctets@ii
    \the\numexpr(#1+32)/64-1\expandafter.%
    \the\numexpr#1\expandafter\expandafter\expandafter.%
    \csname utf8byte\the\numexpr128+#2-#1*64\endcsname}%
\def\uni@threeoctets@ii#1.#2.{%
    \csname utf8byte\the\numexpr224+#1\expandafter\expandafter\expandafter
    \endcsname
    \csname utf8byte\the\numexpr128+#2-#1*64\endcsname
}%

\def\uni@fouroctets#1{\expandafter\uni@fouroctets@i
    \the\numexpr(#1+32)/64-1.#1.}%
\def\uni@fouroctets@i#1.#2.{\expandafter\uni@fouroctets@ii
    \the\numexpr(#1+32)/64-1\expandafter.%
    \the\numexpr#1\expandafter\expandafter\expandafter.%
    \csname utf8byte\the\numexpr128+#2-#1*64\endcsname}%
\def\uni@fouroctets@ii#1.#2.{\expandafter\uni@fouroctets@iii
    \the\numexpr(#1+32)/64-1\expandafter.%
    \the\numexpr#1\expandafter\expandafter\expandafter.%
    \csname utf8byte\the\numexpr128+#2-#1*64\endcsname}%
\def\uni@fouroctets@iii#1.#2.{%
    \csname utf8byte\the\numexpr240+#1\expandafter\expandafter\expandafter
    \endcsname
    \csname utf8byte\the\numexpr128+#2-#1*64\endcsname}%

\makeatother

% xinttools loaded only for easying up loops in illustrative code

\usepackage{xinttools}

\makeatletter

\newcommand*{\WriteOutSixtyFourUTFchars}[2]{%
% #1 = starting Unicode code point
% #2 = out stream
  \begingroup
    \set@display@protect %%<<<---- makes \protect=\string
    \def\mymacro##1{\UniToUTF{#1+##1}}%
    \immediate\write#2{\xintApplyUnbraced\mymacro{\xintSeq{0}{63}}}%
  \endgroup
}

% For \edef we need to take into account  LaTeX's protection mechanism
% for the UTF-8
% leading bytes, as we defined them to be active for inputenc+utf8 readiness.
\begingroup
\global\let\protectedATedef\protected@edef
\endgroup
\makeatother

\begin{document}

\protectedATedef\x{\UniToUTF{"20AC}}
\typeout{\meaning\x}
\typeout{\x}
\typeout{\UniToUTF{"20AC}}
\typeout{}
\protectedATedef\x{\UniToUTF{"03B1}}
\typeout{\meaning\x}
\typeout{\x}
\typeout{\UniToUTF{"03B1}}
\typeout{}
\protectedATedef\x{\UniToUTF{"0416}}
\typeout{\meaning\x}
\typeout{\x}
\typeout{\UniToUTF{"0416}}

{\catcode127 12
\if\UniToUTF{127}\string^^?\else\ERROR\fi
}
\if\UniToUTF{126}\string~\else\ERROR\fi

{\makeatletter
\if\UniToUTF{37}\expandafter\@gobble\string\%\else\ERROR\fi}

\newwrite\foo

\immediate\openout\foo=\jobname ASCIIMEANINGS.txt

% we don't use \typeout here, but write to another file,
% else Emacs/AUCTeX wrongly
% reports a compilation error due to misinterpreation of log
% containing the ! character

\begingroup
\makeatletter
\set@display@protect %%<<<---- makes \protect=\string
\xintFor* #1 in{\xintSeq{0}{127}}\do
{\protectedATedef\x{\UniToUTF{#1}}%
 \immediate\write\foo{\UniToUTF{#1} (\meaning\x)}}
\endgroup

\immediate\closeout\foo

\immediate\openout\foo=\jobname ASCII.txt

\WriteOutSixtyFourUTFchars{"0000}\foo
\WriteOutSixtyFourUTFchars{"0040}\foo

\immediate\closeout\foo

\immediate\openout\foo=\jobname0380.txt

\WriteOutSixtyFourUTFchars{"0380}\foo
\WriteOutSixtyFourUTFchars{"03C0}\foo

\immediate\closeout\foo

\immediate\openout\foo=\jobname0400.txt

\WriteOutSixtyFourUTFchars{"0400}\foo
\WriteOutSixtyFourUTFchars{"0440}\foo
\WriteOutSixtyFourUTFchars{"0480}\foo
\WriteOutSixtyFourUTFchars{"04C0}\foo
\WriteOutSixtyFourUTFchars{"0500}\foo

\immediate\closeout\foo

\immediate\openout\foo=\jobname2000.txt

\WriteOutSixtyFourUTFchars{"2000}\foo
\WriteOutSixtyFourUTFchars{"2040}\foo

\immediate\closeout\foo

\immediate\openout\foo=\jobname4000.txt

\WriteOutSixtyFourUTFchars{"4000}\foo
\WriteOutSixtyFourUTFchars{"4040}\foo

\immediate\closeout\foo

\immediate\openout\foo=\jobname10000.txt

\WriteOutSixtyFourUTFchars{"10000}\foo
\WriteOutSixtyFourUTFchars{"10040}\foo

\immediate\closeout\foo
\end{document}

To illustrate the functioning here is what one finds in log:

macro:->\protect €
€
€

macro:->\protect α
α
α

macro:->\protect Ж
Ж
Ж

then in file \jobname ASCII.txt

^^@^^A^^B^^C^^D^^E^^F^^G^^H 
^^L^^M^^N^^O^^P^^Q^^R^^S^^T^^U^^V^^W^^X^^Y^^Z^^[^^\^^]^^^^^_ !"#$%&'()*+,-./0123456789:;<=>?
@ABCDEFGHIJKLMNOPQRSTUVWXYZ[\]^_`abcdefghijklmnopqrstuvwxyz{|}~^^?

(the CTRL-I, CTRL-J, CTRL-K are genuine bytes)

then in file \jobname ASCIIMEANINGS.txt (this is from default situation, one wonders why CTRL-K which is vertical tabulation is handled as genuine byte... see remark below about pdflatex's -8bit option ...)

^^@ (macro:->^^@)
^^A (macro:->^^A)
^^B (macro:->^^B)
^^C (macro:->^^C)
^^D (macro:->^^D)
^^E (macro:->^^E)
^^F (macro:->^^F)
^^G (macro:->^^G)
^^H (macro:->^^H)
     (macro:->  )

 (macro:->
)
 (macro:->)
^^L (macro:->^^L)
^^M (macro:->^^M)
^^N (macro:->^^N)
^^O (macro:->^^O)
^^P (macro:->^^P)
^^Q (macro:->^^Q)
^^R (macro:->^^R)
^^S (macro:->^^S)
^^T (macro:->^^T)
^^U (macro:->^^U)
^^V (macro:->^^V)
^^W (macro:->^^W)
^^X (macro:->^^X)
^^Y (macro:->^^Y)
^^Z (macro:->^^Z)
^^[ (macro:->^^[)
^^\ (macro:->^^\)
^^] (macro:->^^])
^^^ (macro:->^^^)
^^_ (macro:->^^_)
  (macro:-> )
! (macro:->!)
" (macro:->")
# (macro:->#)
$ (macro:->$)
% (macro:->%)
& (macro:->&)
' (macro:->')
( (macro:->()
) (macro:->))
* (macro:->*)
+ (macro:->+)
, (macro:->,)
- (macro:->-)
. (macro:->.)
/ (macro:->/)
0 (macro:->0)
1 (macro:->1)
2 (macro:->2)
3 (macro:->3)
4 (macro:->4)
5 (macro:->5)
6 (macro:->6)
7 (macro:->7)
8 (macro:->8)
9 (macro:->9)
: (macro:->:)
; (macro:->;)
< (macro:-><)
= (macro:->=)
> (macro:->>)
? (macro:->?)
@ (macro:->@)
A (macro:->A)
B (macro:->B)
C (macro:->C)
D (macro:->D)
E (macro:->E)
F (macro:->F)
G (macro:->G)
H (macro:->H)
I (macro:->I)
J (macro:->J)
K (macro:->K)
L (macro:->L)
M (macro:->M)
N (macro:->N)
O (macro:->O)
P (macro:->P)
Q (macro:->Q)
R (macro:->R)
S (macro:->S)
T (macro:->T)
U (macro:->U)
V (macro:->V)
W (macro:->W)
X (macro:->X)
Y (macro:->Y)
Z (macro:->Z)
[ (macro:->[)
\ (macro:->\)
] (macro:->])
^ (macro:->^)
_ (macro:->_)
` (macro:->`)
a (macro:->a)
b (macro:->b)
c (macro:->c)
d (macro:->d)
e (macro:->e)
f (macro:->f)
g (macro:->g)
h (macro:->h)
i (macro:->i)
j (macro:->j)
k (macro:->k)
l (macro:->l)
m (macro:->m)
n (macro:->n)
o (macro:->o)
p (macro:->p)
q (macro:->q)
r (macro:->r)
s (macro:->s)
t (macro:->t)
u (macro:->u)
v (macro:->v)
w (macro:->w)
x (macro:->x)
y (macro:->y)
z (macro:->z)
{ (macro:->{)
| (macro:->|)
} (macro:->})
~ (macro:->~)
^^? (macro:->^^?)

Important update: to get genuine bytes for the entire 7bit ascii range, use pdflatex -8bit !

then in file \jobname 0380.txt

΁΂΃΄΅Ά·ΈΉΊ΋Ό΍ΎΏΐΑΒΓΔΕΖΗΘΙΚΛΜΝΞΟΠΡ΢ΣΤΥΦΧΨΩΪΫάέήίΰαβγδεζηθικλμνξο
πρςστυφχψωϊϋόύώϏϐϑϒϓϔϕϖϗϘϙϚϛϜϝϞϟϠϡϢϣϤϥϦϧϨϩϪϫϬϭϮϯϰϱϲϳϴϵ϶ϷϸϹϺϻϼϽϾϿ

then in file \jobname 0400.txt

ЀЁЂЃЄЅІЇЈЉЊЋЌЍЎЏАБВГДЕЖЗИЙКЛМНОПРСТУФХЦЧШЩЪЫЬЭЮЯабвгдежзийклмноп
рстуфхцчшщъыьэюяѐёђѓєѕіїјљњћќѝўџѠѡѢѣѤѥѦѧѨѩѪѫѬѭѮѯѰѱѲѳѴѵѶѷѸѹѺѻѼѽѾѿ
Ҁҁ҂҃҄҅҆҇҈҉ҊҋҌҍҎҏҐґҒғҔҕҖҗҘҙҚқҜҝҞҟҠҡҢңҤҥҦҧҨҩҪҫҬҭҮүҰұҲҳҴҵҶҷҸҹҺһҼҽҾҿ
ӀӁӂӃӄӅӆӇӈӉӊӋӌӍӎӏӐӑӒӓӔӕӖӗӘәӚӛӜӝӞӟӠӡӢӣӤӥӦӧӨөӪӫӬӭӮӯӰӱӲӳӴӵӶӷӸӹӺӻӼӽӾӿ
ԀԁԂԃԄԅԆԇԈԉԊԋԌԍԎԏԐԑԒԓԔԕԖԗԘԙԚԛԜԝԞԟԠԡԢԣԤԥԦԧԨԩԪԫԬԭԮԯ԰ԱԲԳԴԵԶԷԸԹԺԻԼԽԾԿ

then in file \jobname 4000.txt

䀀䀁䀂䀃䀄䀅䀆䀇䀈䀉䀊䀋䀌䀍䀎䀏䀐䀑䀒䀓䀔䀕䀖䀗䀘䀙䀚䀛䀜䀝䀞䀟䀠䀡䀢䀣䀤䀥䀦䀧䀨䀩䀪䀫䀬䀭䀮䀯䀰䀱䀲䀳䀴䀵䀶䀷䀸䀹䀺䀻䀼䀽䀾䀿
䁀䁁䁂䁃䁄䁅䁆䁇䁈䁉䁊䁋䁌䁍䁎䁏䁐䁑䁒䁓䁔䁕䁖䁗䁘䁙䁚䁛䁜䁝䁞䁟䁠䁡䁢䁣䁤䁥䁦䁧䁨䁩䁪䁫䁬䁭䁮䁯䁰䁱䁲䁳䁴䁵䁶䁷䁸䁹䁺䁻䁼䁽䁾䁿

as examples. Rendering in your browser will depend on its font set-up.

Here is the \jobname 10000.txt file as it shows in an Emacs buffer:

enter image description here

Update of an image to show how the ASCIIMEANINGS file renders in an Emacs buffer, proving the ^^K is indeed a one-byte there, in the default usage of pdflatex:

enter image description here

Important added note

This is the default situation, but using pdflatex -8bit solves all our anguishes, here is Emacs buffer screen shot of ASCII output file:

enter image description here

Thus the -8bit option we can indeed create arbitrary UTF-8 encoded files from within pdflatex, using only Unicode code points as input.


\documentclass{article}
\usepackage[utf8]{inputenc}
\DeclareUnicodeCharacter{20AC}{Hallo}
\DeclareUnicodeCharacter{12345}{Bye}
\begin{document}

There are two parts to doing what the question desires:

  • Converting a Unicode code point to its UTF-8 encoding, e.g. U+20AC to the sequence of bytes E2 82 AC.
  • Injecting those bytes, or the equivalent of typing those bytes, into the input stream.

The first part is a UTF-8 encoder. It's already implemented in utf8ienc.dtx / utf8.def, but a fresh implementation is below. It was fun to write, and is different from the official implementation. (For one thing, it uses the e-TeX extension \numexpr which could not be assumed when that original implementation was written… but apart from using that new primitive, it does not use any external macros, and works in both plain (e)TeX and LaTeX.)

This is how it works (and how UTF-8 is defined): Given a number n (which stands for a Unicode codepoint, e.g. "20AC is the number 8364, representing the codepoint U+20AC Euro Sign),

  • Look at the size of n, to determine how many bytes will be the UTF-8 encoding
  • Peel off six bits at a time from the right: these bits, with 10 prefixed, become the rightmost bytes
  • For the first byte, use the leftover bits with a prefix that depends on the number of bytes

The code which does it:

% Division and remainder. Based on https://tex.stackexchange.com/a/34449/48 but with bugs fixed. Note: assumes #2 positive
\def\modulo#1#2{\ifnum \numexpr(#1 - (#1/#2)*(#2))\relax < 0 (#1 - (#1/#2)*(#2) + #2) \else (#1 - (#1/#2)*(#2)) \fi}
\def\truncdiv#1#2{((#1 - \modulo{#1}{#2})/(#2))}
% The hypothetical continuation bytes: last byte, the one before that, etc.
\def\utfByteLastOne  #1{\numexpr(128 + \modulo{#1}{64})\relax}
\def\utfByteLastTwo  #1{\numexpr(128 + \modulo{\truncdiv{#1}{64}}{64})\relax}
\def\utfByteLastThree#1{\numexpr(128 + \modulo{\truncdiv{#1}{(64*64)}}{64})\relax}
\def\utfByteLastFour #1{\numexpr(128 + \modulo{\truncdiv{#1}{(64*64*64)}}{64})\relax}
% The actual individual bytes in the stream, for lengths 1 to 4.
\def\utfStreamOneByteOne    #1{#1}
\def\utfStreamTwoByteOne    #1{\numexpr(\utfByteLastTwo{#1} + 64)\relax}
\def\utfStreamTwoByteTwo    #1{\utfByteLastOne{#1}}
\def\utfStreamThreeByteOne  #1{\numexpr(\utfByteLastThree{#1} + 96)\relax}
\def\utfStreamThreeByteTwo  #1{\utfByteLastTwo{#1}}
\def\utfStreamThreeByteThree#1{\utfByteLastOne{#1}}
\def\utfStreamFourByteOne   #1{\numexpr(\utfByteLastFour{#1} + 112)\relax}
\def\utfStreamFourByteTwo   #1{\utfByteLastThree{#1}}
\def\utfStreamFourByteThree #1{\utfByteLastTwo{#1}}
\def\utfStreamFourByteFour  #1{\utfByteLastOne{#1}}
% Expands to \utfCallbackOne{#1} or ... or \utfCallbackFour{#1} depending on whether the code point #1 has 1, 2, 3, or 4 bytes in its UTF-8 encoding.
\def\utfStreamFromNumber#1{%
    \ifnum #1 < 128
        \utfCallbackOne{#1}%
    \else \ifnum #1 < 2048  % 2^11
              \utfCallbackTwo{#1}%
          \else \ifnum #1 < 65536  % 2^16
                    \utfCallbackThree{#1}%
                \else
                    \utfCallbackFour{#1}%
                \fi
          \fi
    \fi
}

The interface of this UTF-8 encoder is quite general (and also completely independent of inputenc) in the sense that you can simply redefine the \utfCallbackOne to \utfCallbackFour macros, to do different things with the bytes of the UTF-8 stream. For example, see the first revision of this answer for a version that simply prints out the bytes (and without \usepackage[utf8]{inputenc}).


The second part (injecting arbitrary bytes/tokens into the input stream, as if the user had typed them) is straightforward in LuaTeX, but nontrivial in other engines. Firstly, if all we want to do is emulate what inputenc would have done if those tokens were inserted (which is what the question asks for), then we can use the fact that inputenc stores the \DeclareUnicodeCharacter definitions in internal macros named u8:<byte1><byte2><byte3> etc., and simply invoke those macros. The code below uses the \lccode and \lowercase trick (inspired by utf8ienc.dtx) to construct the macro names that have “special” bytes in them.

% These get the full byte stream -- here, do whatever you want with the individual bytes.
% In this case, these callbacks call internal control sequences u8:... defined by \DeclareUnicodeCharacter
\def\utfCallbackOne  #1{{\char\utfStreamOneByteOne{#1}}}
\def\utfCallbackTwo  #1{{\lccode`A=\utfStreamTwoByteOne{#1}\lccode`B=\utfStreamTwoByteTwo{#1}\lowercase{\csname u8:AB\endcsname}}}
\def\utfCallbackThree#1{{\lccode`A=\utfStreamThreeByteOne{#1}\lccode`B=\utfStreamThreeByteTwo{#1}\lccode`C=\utfStreamThreeByteThree{#1}\lowercase{\csname u8:ABC\endcsname}}}
\def\utfCallbackFour #1{{\lccode`A=\utfStreamFourByteOne{#1}\lccode`B=\utfStreamFourByteTwo{#1}\lccode`C=\utfStreamFourByteThree{#1}\lccode`D=\utfStreamFourByteFour{#1}\lowercase{\csname u8:ABCD\endcsname}}}
\newcommand\unicodetoutf[1]{\utfStreamFromNumber{"#1}}

€ ^^e2^^82^^ac \unicodetoutf{20AC}

© ^^c2^^a9 \unicodetoutf{00A9}

 ^^f0^^92^^8d^^85 \unicodetoutf{12345}

\end{document}

output

But this is “cheating” in some sense: we're using the fact that we know the internal names of the macros, to call them. Also, if we use a character that has not been given a definition with \DeclareUnicodeCharacter, then \csname u8:... has no definition and TeX treats it as \relax, so we get no error — so we need a wrapper that first checks whether it has been defined, for which we need to reproduce or borrow more from inputenc.

Instead, we can extend the \lccode/\lowercase trick to get the effect of having typed those characters, without assuming anything about the internals of inputenc. Let's think about it: why can't we just write \lowercase{AB}, after setting the \lccodes of A and B? It's because \lowercase changes the character codes, but not the category codes. So we need to change the category codes of A and B as well, to be the same as those of the bytes in the UTF-8 stream. But even after doing that, if we say \lowercase{AB} inside the macro, then it picks up the catcodes at the time of reading the macro, which is not what we want. One solution I was able to find was \scantokens (introduced in e-TeX). Using it, we can get a “true” solution, with no “cheating”: replace the previous snippet with:

% The full byte stream -- here, do whatever you want with the individual bytes.
\newcount\charX
\newcount\charY
\newcount\charZ
\newcount\charW
\def\utfCallbackOne  #1{{%
    \charX=\utfStreamOneByteOne{#1}%
    \catcode255=\catcode\charX%
    \lccode255=\charX%
    \lowercase{\scantokens{^^ff\endlinechar=-1}}}}
\def\utfCallbackTwo  #1{{%
   \charX=\utfStreamTwoByteOne{#1}%
   \charY=\utfStreamTwoByteTwo{#1}%
   \catcode`X=\catcode\charX%
   \catcode`Y=\catcode\charY%
   \lccode`X=\charX%
   \lccode`Y=\charY%
   \lowercase{\scantokens{XY\endlinechar=-1}}}}
\def\utfCallbackThree#1{{%
   \charX=\utfStreamThreeByteOne{#1}%
   \charY=\utfStreamThreeByteTwo{#1}%
   \charZ=\utfStreamThreeByteThree{#1}%
   \catcode`X=\catcode\charX%
   \catcode`Y=\catcode\charY%
   \catcode`Z=\catcode\charZ%
   \lccode`X=\charX%
   \lccode`Y=\charY%
   \lccode`Z=\charZ%
   \lowercase{\scantokens{XYZ\endlinechar=-1}}}}
\def\utfCallbackFour #1{{%
   \charX=\utfStreamFourByteOne{#1}%
   \charY=\utfStreamFourByteTwo{#1}%
   \charZ=\utfStreamFourByteThree{#1}%
   \charW=\utfStreamFourByteFour{#1}%
   \catcode`X=\catcode\charX%
   \catcode`Y=\catcode\charY%
   \catcode`Z=\catcode\charZ%
   \catcode`W=\catcode\charW%
   \lccode`X=\charX%
   \lccode`Y=\charY%
   \lccode`Z=\charZ%
   \lccode`W=\charW%
   \lowercase{\scantokens{XYZW\endlinechar=-1}}}}
\newcommand\unicodetoutf[1]{\utfStreamFromNumber{"#1}}

A ^^41 \unicodetoutf{0041}

\def~{what}
~ ^^7e \unicodetoutf{007E}

€ ^^e2^^82^^ac \unicodetoutf{20AC}

© ^^c2^^a9 \unicodetoutf{00A9}

 ^^f0^^92^^8d^^85 \unicodetoutf{12345}

\end{document}

Now everything works! The same output as before, and in each case we have injected tokens equivalent to having typed out those Unicode characters in the file directly.


This directly implements the Unicode to UTF-8 conversion algorithm, in fully expandable way.

It uses the standard feature that an UTF-8 sequence abcd (at most four characters) is transformed into the control sequence \u8:abcd.

If the Unicode point is not defined, pdflatex stops with an error

! Undefined control sequence.
<argument> \ERROR 
                  BAD UTF (U+AAAA)
l.98 \unicodetoutf{AAAA}

Here's the code

% algorithm from https://home.kpn.nl/vanadovv/uni/utf8conversion.html

\documentclass{article}
\usepackage{xparse}

\ExplSyntaxOn
\NewExpandableDocumentCommand{\unicodetoutf}{m}
 {
  \int_compare:nTF { "#1 < 128 }
   {
    \char_generate:nn { "#1 } { 12 }
   }
   {
    \egreg_unicodetoutf:n { #1 }
   }
 }

\cs_new:Nn \egreg_unicodetoutf:n
 {
  \__egreg_unicodetoutf:nf { #1 }
   {
    \int_compare:nTF { "#1 < 2048 }
     {
      \__egreg_unicodetoutf_two:n { #1 }
     }
     {
      \int_compare:nTF { "#1 < 65536 }
       {
        \__egreg_unicodetoutf_three:n { #1 }
       }
       {
        \__egreg_unicodetoutf_four:n { #1 }
       }
     }
   }
 }

\cs_new:Nn \__egreg_unicodetoutf:nn
 {
  \cs_if_exist_use:cF { #2 } { \ERROR BAD~UTF~(U+#1) }
 }
\cs_generate_variant:Nn \__egreg_unicodetoutf:nn { nf }

\cs_new:Nn \__egreg_unicodetoutf_two:n
 {
  u8:
  \char_generate:nn { 192 + \int_div_truncate:nn { "#1 } { 64 } } { 12 }
  \char_generate:nn { 128 + \int_mod:nn { "#1 } { 64 } } { 12 }
 }

\cs_new:Nn \__egreg_unicodetoutf_three:n
 {
  u8:
  \char_generate:nn { 224 + \int_div_truncate:nn { "#1 } { 4096} } { 12 }
  \char_generate:nn
   {
    128 + \int_mod:nn { \int_div_truncate:nn { "#1 } { 64 } } { 64 }
   }
   { 12 }
  \char_generate:nn { 128 + \int_mod:nn { "#1 } { 64 } } { 12 }
 }

\cs_new:Nn \__egreg_unicodetoutf_four:n
 {
  u8:
  \char_generate:nn { 240 + \int_div_truncate:nn { "#1 } { 262144 } } { 12 }
  \char_generate:nn
   {
    128 + \int_mod:nn { \int_div_truncate:nn { "#1 } { 4096 } } { 64 }
   }
   { 12 }
  \char_generate:nn
   {
    128 + \int_mod:nn { \int_div_truncate:nn { "#1 } { 64 } } { 64 }
   }
   { 12 }
  \char_generate:nn { 128 + \int_mod:nn { "#1 } { 64 } } { 12 }
 }
\ExplSyntaxOff

\DeclareUnicodeCharacter{00E9}{b}
\DeclareUnicodeCharacter{20AC}{Hallo}
\DeclareUnicodeCharacter{10006}{World}

\begin{document}

a ^^61 \unicodetoutf{0061}

é ^^c3^^a9 \unicodetoutf{00E9}

€ ^^e2^^82^^ac \unicodetoutf{20AC} %<-- should output Hallo

 ^^f0^^90^^80^^86 \unicodetoutf{10006} % should output World

\edef\temp{\unicodetoutf{20AC}}
\texttt{\meaning\temp}

\unicodetoutf{AAAA}

\end{document}

enter image description here