utf8x vs. utf8 (inputenc)

The simple answer is that utf8x is to be avoided if possible. It loads the ucs package, which for a long time was unmaintained (although there is now a new maintainer) and breaks various other things.

See egreg's answer to this question as well, which outlines how to get extra characters using the [utf8] option of inputenc.

Generally, however, the best way to deal with Unicode source (especially with non-latin scripts) is really XeLaTeX or LuaLaTeX.

There's an extended discussion of this here: Encoding remarks. See especially the comments by Philipp Lehman and Philipp Stephani.


In fact, utf8 may not be as restrictive as it seems: it only loads characters that can be displayed by the font encoding.

When typing

\usepackage[utf8]{inputenc}
\usepackage[T1]{fontenc}

the font encoding is still OT1 when loading inputenc, which has very few characters. By using

\usepackage[T1]{fontenc} 
\usepackage[utf8]{inputenc}

you will allow all displayable utf8 characters to be available as input.


Don't use utf8x; with an up-to-date TeX distribution it could show necessary only for its most obscure features (faking characters with images from the Web, for instance).

The problem with Greek, which was probably the main reason for adopting utf8x instead of utf8, have since be solved and

\documentclass{article}

\usepackage[T1]{fontenc}
\usepackage[utf8]{inputenc}
\usepackage[polutonikogreek,english]{babel}

\begin{document}

This is english
\textgreek{Τηις ις γρεεκ}
This is english again.

\end{document}

will happily print

enter image description here

The occasional missing definitions can be coped with in a simple way. If you're able to input a Unicode character, such as the Welsh letters

Ââ Êê Îî Ôô Ŵŵ Ŷŷ Ïï

or the Latin vowels with prosodic marks

Ăă Ĕĕ Ĭĭ Ŏŏ Ŭŭ Āā Ēē Īī Ōō Ūū Ȳȳ

(y with breve is missing from Unicode, while a with breve is already defined by utf8 because it's a letter in Romanian), you can simply add the unknown ones to the list of known characters:

\documentclass{article}

\usepackage[T1]{fontenc}
\usepackage[utf8]{inputenc}
\usepackage{newunicodechar}

% missing Welsh coverage
\newunicodechar{Ŵ}{\^W}
\newunicodechar{ŵ}{\^w}
\newunicodechar{Ŷ}{\^Y}
\newunicodechar{ŷ}{\^y}

% Latin vowels with prosodic marks    
\newunicodechar{Ĕ}{\u{E}}
\newunicodechar{ĕ}{\u{e}}
\newunicodechar{Ĭ}{\u{I}}
\newunicodechar{ĭ}{\u{\i}}
\newunicodechar{Ŏ}{\u{O}}
\newunicodechar{ŏ}{\u{o}}
\newunicodechar{Ŭ}{\u{U}}
\newunicodechar{ŭ}{\u{u}}
\newunicodechar{Ā}{\=A}
\newunicodechar{ā}{\=a}
\newunicodechar{Ē}{\=E}
\newunicodechar{ē}{\=e}
\newunicodechar{Ī}{\=I}
\newunicodechar{ī}{\={\i}}
\newunicodechar{Ō}{\=O}
\newunicodechar{ō}{\=o}
\newunicodechar{Ū}{\=U}
\newunicodechar{ū}{\=u}
\newunicodechar{Ȳ}{\=Y}
\newunicodechar{ȳ}{\=y}

\begin{document}

Ââ Êê Îî Ôô Ŵŵ Ŷŷ Ïï

Ăă Ĕĕ Ĭĭ Ŏŏ Ŭŭ

Āā Ēē Īī Ōō Ūū Ȳȳ

\end{document}

enter image description here

Note that, for instance, the line

\newunicodechar{Ŵ}{\^W}

can be also input as

\DeclareUnicodeCharacter{0174}{\^W}

without the need of the newunicodechar package, because U+0174 is the code point of LATIN CAPITAL LETTER W WITH CIRCUMFLEX; but \newunicodechar frees from looking up in the Unicode tables.


Update, April 2016

With a recent LaTeX kernel almost none of the definitions above is necessary, because T1enc.dfu has been updated and enriched. Of the accented letters in the last example, only Ȳ and ȳ need to be defined (and they'll possibly be included in next releases).

\documentclass{article}

\usepackage[T1]{fontenc}
\usepackage[utf8]{inputenc}
\usepackage{newunicodechar}

\newunicodechar{Ȳ}{\=Y}
\newunicodechar{ȳ}{\=y}

\begin{document}

Ââ Êê Îî Ôô Ŵŵ Ŷŷ Ïï

Ăă Ĕĕ Ĭĭ Ŏŏ Ŭŭ

Āā Ēē Īī Ōō Ūū Ȳȳ

\end{document}

enter image description here