Strange character in a file

This file contains bytes C2 96, which are the UTF-8 encoding of codepoint U+0096. That codepoint is one of the C1 control characters commonly called SPA "Start of Guarded Area" (or "Protected Area"). That isn't a useful character for any modern system, but it's unlikely to be harmful that it's there.

The original source for this was likely a byte 0x96 in some single-byte 8-bit encoding that has been transcoded incorrectly somewhere along the way. Probably this was originally a Windows CP1252 en dash "–", which has byte value 96 in that encoding - most other plausible candidates have the control set at positions 80-9F - which has been translated to UTF-8 as though it was latin-1 (ISO/IEC 8859-1), which is not uncommon. That would lead to the byte being interpreted as the control character and translated accordingly as you've seen.


You can fix this file with the iconv tool, which is part of glibc.

iconv -f utf-8 -t iso-8859-1 < mwe.txt | iconv -f cp1252 -t utf-8

produces a correct version of your minimal example for me. That works by first converting the UTF-8 to latin-1 (inverting the earlier mistranslation), and then reinterpreting that as cp1252 to convert it back to UTF-8 correctly.

It does depend on what else is in the real file, however. If you have characters outside Latin-1 elsewhere it will fail because it can't encode those correctly at the first step.

If you don't have iconv, or it doesn't work for the real file, you can replace the bytes directly using sed:

LC_ALL=C sed -e $'s/\xc2\x96/\xe2\x80\x93/g' < mwe.txt

This replaces C2 96 with the UTF-8 en dash encoding E2 80 93. You could also replace it with e.g. a hyphen or two by changing \xe2\x80\x93 into --.


You can grep in a similar fashion. We're using LC_ALL=C to make sure we're reading the actual bytes, and not having grep interpret things:

LC_ALL=C grep -R $'\xc2\x96` .

will list out everywhere under this directory those bytes appear. You may want to limit it to just text files if you have mixed content around, since binary files will include any pair of bytes fairly often.


0x96 is an en dash in the Windows codepage 1252. The c2 byte preceding it seems to be a default first byte in a double-width character. Someone else could explain more precisely about it.

To search for other occurrences, put your cursor over it in command mode, hit yl (yank one character), then type /<Ctrl>+r". (ctrl+r lets you insert the contents of a register into the command, and the " register is whatever has last been yanked).

Just replace it with two hyphens if you want it to render in your terminal. If that is a bibtex file that you have, then two hyphens are the appropriate way to key it in.

To show how you can find occurrences of the character, you can pipe it through a hexdump tool like xxd.

$ cat tmp | xxd | grep c296
00000000: 7061 6765 733d 7b31 c296 3935 7d2c 0a70  pages={1..95},.p
00000020: 6765 733d 7b31 c296 3935 7d2c 0a70 6167  ges={1..95},.pag
00000040: 733d 7b31 c296 3935 7d2c 0a70 6167 6573  s={1..95},.pages
00000060: 7b31 c296 3935 7d2c 0a70 6167 6573 3d7b  {1..95},.pages={
00000080: c296 3935 7d2c 0a70 6167 6573 3d7b 31c2  ..95},.pages={1.
00000090: 9639 357d 2c0a 7061 6765 733d 7b31 c296  .95},.pages={1..
000000b0: 357d 2c0a 7061 6765 733d 7b31 c296 3935  5},.pages={1..95
000000d0: 2c0a 7061 6765 733d 7b31 c296 3935 7d2c  ,.pages={1..95},
000000f0: 7061 6765 733d 7b31 c296 3935 7d2c 0a70  pages={1..95},.p
00000110: 6765 733d 7b31 c296 3935 7d2c 0a70 6167  ges={1..95},.pag
00000130: 733d 7b31 c296 3935 7d2c 0a70 6167 6573  s={1..95},.pages
00000150: 7b31 c296 3935 7d2c 0a70 6167 6573 3d7b  {1..95},.pages={