What is the difference between a byte and a character (at least *nixwise)?

POSIXly, emphasis mine:

3.87 Character
A sequence of one or more bytes representing a single graphic symbol or control code.

In practice, the exact meaning depends on the locale in effect, e.g. under the "C" locale, printf '\xc3\xa4\xc3\xb6' |wc -m gives 4, since it effectively counts bytes; while under a UTF-8 locale that gives 2, since that's the two UTF-8 encoded characters äö. Assuming your terminal is also set to UTF-8, you could of course just write printf 'äö'.

_{(Note that wc -c is defined to count bytes, not characters, confusingly enough.)}

Worse, character support also depends on the utility, and not everything deals with multi-byte characters cleanly (let alone all the quirks of Unicode). E.g. GNU tr deals with bytes, regardless of what its man page says:

$ printf ä | tr ä xy; echo
xx
$ printf ö | tr ä xy; echo
x�

The first is same as tr '\303\244' 'xy', so both bytes of ä get replaced, and that second happens because the first byte of both ä and ö is the same. Of course, if it really dealt with characters, those should print x and ö.

A byte is by convention and POSIX definition eight bits. A bit is a binary digit (i. e. the fundamental 1 or 0 that is at the base of nearly all digital computing).

A character is often one byte and in some contexts (e. g. ASCII) can be defined to be one byte in length. However, Unicode and UTF-8 and UTF-16 define expanded character sets wherein a single character (or glyph) can be defined by data payloads longer than one byte in length.

The single character:

Q̴̢̪̘̳̣̞̩̪̑̍̉̆̉͛̑̂̕͝

is a single character, but it is composed in Unicode by applying multiple accents (or diacritics) to the base glyph, the simple Q. This encoding is many more bytes than one in length: Putting solely that character into a file and displaying the contents with hexdump rather than cat on my locale yields:

$ hexdump -C demo
00000000  51 cc b4 cc 91 cc 8d cc  89 cc 86 cc 89 cd 9d cd  |Q...............|
00000010  9b cc 91 cc 95 cc 82 cc  aa cc 98 cc b3 cc a3 cc  |................|
00000020  a2 cc 9e cc a9 cc aa 0a                           |........|
00000028

A byte is the basic element, usually 8 bits long (also called an octet), though there have been (and there probably still are) other sizes. With an 8-bit byte, you can encode 256 different values (from 0 to 255).

For characters, things vary based on the encoding and character set used.

The simplest most common encoding/character set is ASCII. Each character uses one byte (actually less, only 7 bits). It includes lowercase and uppercase letters of the english alphabet without diacritics (accents and the like), digits, common punctuation, and control characters.
Then we have a series of 8-bit character sets such as the ISO-8859 series, MS-DOS and Windows codepages, Mac charsets, etc.

Those are supersets of ASCII (the first 128 values are the same as ASCII), with the 128 other values used for locale-specific characters (accented characters, alternative scripts such as greek or cyrillic...).

This caused all sorts of headaches when transferring files between computers and even between programs, as not all used the same character set.

In that case, one character was still one byte.
Then came the Unicode family, which tried to unify everything in a single set, which was obviously larger than 256, so didn't fit in a single byte.
At first, it was thought that 16 bits would be enough, and UCS-2 was devised, using 2 bytes per character (which would mean a maximum of 65536 possible characters, though not all were assigned, which allowed UTF-16).
Then it became clear that 2 bytes would not always be enough, so UTF-16 was introduced, which uses surrogate pairs to encode additional characters. For characters in the BMP (Basic Multilingual Plane), they still use 2 bytes exactly, but for the "extra" characters, they use 2 code units of 2 bytes each, for a total of 4 bytes.

UTF-16 is the native encoding of Windows NT and successors. However even UTF-16 had issues, as not everybody agreed on the order of the two bytes: little-endian or big-endian, so we have UTF-16LE and UTF-16BE. With or without BOM.
There's also UCS-4 and UTF-32 which use 4 bytes per character (UTF-32 is limited to the values which can be expressed as UTF-16), but those are quite rare.
UTF-8 is a variable length encoding which is probably becoming the most common encoding. A character can be encoded as anywhere between 1 and 4 bytes.

The genius in UTF-8 is that the ASCII part of Unicode (code points 0 to 127) is still encoded as a single byte, and code points beyond that are guaranteed to never include bytes between 0 and 127. This allows a certain level of compatibility with software where specific characters have a special meaning, including / (or \ or :) for paths, a lot of punctuation (!=+-*/^"'<>[]{} etc.) for programming languages and shells, control characters such as CR, LF or tab, spaces, etc.

But in Unicode, there's an additional complexity: code points can be composed. You can encode é as a single character é (U+00E9 LATIN SMALL LETTER E WITH ACUTE), or as e (U+0065 LATIN SMALL LETTER E) followed by ◌́ (U+0301 COMBINING ACUTE ACCENT). As shown in DopeGhoti's answer, you can stack up quite a few combining marks on a single letter!

Diacritics are not the only combining code points. There are many which are used to make variations, especially for emoji. You can change their skin colour (), their gender (), their age (), combine them to make a family: ‍‍... The last one is 5 code points and takes 18 bytes!

What is the difference between a byte and a character (at least *nixwise)?

Tags:

Terminology

Character Encoding

Byte

Special Characters

Escape Characters

Related

Recent Posts