Why doesn't GIT natively support UTF-16

I devote a significant chunk of a full chapter of my (currently rather moribund) book (see Chapter 3, which is in better shape than later chapters) to the issue of character encoding, because it is a historical mess. It's worth mentioning here, though, that part of the premise of this question—that Git supports UTF-7 and UTF-32 in some way—is wrong: UTF-7 is a standard that never even came about and should probably never be used at all (so naturally, older Internet Explorer versions do, and this leads to the security issue mentioned on the linked Wikipedia page).

That said, let's first separate character encoding from code pages. (See footnote-ish section below as well.) The fundamental problem here is that computers—well, modern ones anyway—work with a series of 8-bit bytes, with each byte representing an integer in the range [0..255]. Older systems had 6, 7, 8, and even 9-bit bytes, though I think calling anything less than 8 bits a "byte" is misleading. (BBN's "C machines" had 10-bit bytes!) In any case, if one byte represents one character-symbol, this gives us an upper limit of 256 kinds of symbols. In those bad old days of ASCII, that was sufficient, since ASCII had just 128 symbols, 33 of them being non-printing symbols (control codes 0x00 through 0x1f, plus 0x7f representing DEL or a deleted punch on paper tape, writing them in hexadecimal here).

When we needed more than 94 printable symbols plus the space (0x20), we—by we I mean people using computers all over the world, not specifically me—said: Well, look at this, we have 128 unused encodings, 0x80 through 0xff, let's use some of those! So the French used some for ç and é and so on, and punctuation like « and ». The Czechs needed one for Z-with-caron, ž. The Russians needed lots, for Cyrillic. The Greeks needed lots, and so on. The result was that the upper half of the 8-bit space exploded into many incompatible sets, which people called code pages.

Essentially, the computer stores some eight-bit byte value, such as 235 decimal (0xEB hex), and it's up to something else—another computer program, or ultimately a human staring at a screen, to interpret that 235 as, say, a Cyrillic л character, or a Greek λ, or whatever. The code page, if we are using one, tells us what "235" means: what sort of semantics we should impose on this.

The problem here is that there is a limit on how many character codes we can support. If we want to have the Cyrillic L (л) coexist with the Greek L (lambda, λ), we can't use both CP-1251 and CP-1253 at the same time, so we need a better way to encode the symbol. One obvious way is to stop using one-byte values to encode symbols: if we use two-byte values, we can encode 65536 values, 0x0000 through 0xffff inclusive; subtract a few for control codes and there is still room for many alphabets. However, we rapidly blew through even this limit, so we went to Unicode, which has room for 1,114,112 of what it calls code points, each of which represents some sort of symbol with some sort of semantic meaning. Somewhat over 100,000 of these are now in use, including Emoji like and .

Encoding Unicode into bytes or words

This is where UTF-8, UTF-16, UTF-32, UCS-2, and UCS-4 all come in. These are all schemes for encoding Unicode code points—one of those ~1 million values—into byte-streams. I'm going to skip over the UCS ones entirely and look only at the UTF-8 and UTF-16 encodings, since those are the two that are currently the most interesting. (See also What is Unicode, UTF-8, UTF-16?)

The UTF-8 encoding is straightforward: any code point whose decimal value is less than 128 is encoded as a byte containing that value. This means that ordinary ASCII text characters remain ordinary ASCII text characters. Code points in 0x0080 (128 decimal) through 0x07ff (2047 decimal) encode into two bytes, both of whose value is in the 128-255 range and hence distinguishable from a one-byte encoded value. Code points in the 0x0800 through 0xffff range encode into three bytes in that same 128-255 range, and the remaining valid values encode into four such bytes. The key here as far as Git itself is concerned is that no encoded value resembles an ASCII NUL (0x00) or slash (0x2f).

What this UTF-8 encoding does is to allow Git to pretend that text strings—and especially file names—are slash-separated name components whose ends are, or can be anyway, marked with ASCII NUL bytes. This is the encoding that Git uses in tree objects, so UTF-8 encoded tree objects just fit, with no fiddling required.

UTF-16 encoding uses two paired bytes per character. This has two problems for Git and pathnames. First, a byte within a pair might accidentally resemble /, and all ASCII-valued characters necessarily encode as a pair of bytes where one byte is 0x00 which resembles ASCII NUL. So Git would need to know: this path name has been encoded in UTF-16 and work on byte-pairs. There's no room in a tree object for this information, so Git would need a new object type. Second, whenever we break a 16-bit value into two separate 8-bit bytes, we do this in some order: I either give you the more more significant byte first, then the less significant byte; or I give you the less significant byte first, then the more significant one. This second problem leads to the reason that UTF-16 has Byte Order Marks. UTF-8 needs no byte order mark, and suffices, so why not use that in trees? So Git does.

That's fine for trees, but we also have commits, tags, and blobs

Git does its own interpretation of three of these four kinds of objects:

  1. Commits contain hash IDs.
  2. Trees contain path names, file modes, and hash IDs.
  3. Tags contain hash IDs.

The one that's not listed here is the blob, and for the most part, Git does not do any interpretation of blobs.

To make it easy to understand the commits, trees, and tags, Git constrains all three to be in UTF-8 for the most part. However, Git does allow the log message in a commit, or the tag text in a tag, to go somewhat (mostly) uninterpreted. These come after the header that Git interprets, so even if there is something particularly tricky or ugly at this point, that's pretty safe. (There are some minor risks here since PGP signatures, which appear below the headers, do get interpreted.) For commits in particular, modern Git will include an encoding header line in the interpreted section, and Git can then attempt to decode the commit message body, and re-encode it into whatever encoding is used by whatever program is interpreting the bytes that Git spits out.1

The same rules could work for annotated tag objects. I'm not sure if Git has code to do that for tags (the commit code could mostly be re-used, but tags much more commonly have PGP signatures, and it's probably wiser just to force UTF-8 here). Since trees are internal objects, their encoding is largely invisible anyway—you do not need to be aware of this (except for the issues that I point out in my book).

This leaves blobs, which are the big gorilla.


1This is a recurring theme in the computing world: everything is repeatedly encoded and decoded. Consider how something arrives over WiFi or a cable network connection: it's been encoded into some sort of radio wave or similar, and then some hardware decodes that into a bit-stream, which some other hardware re-encodes into a byte stream. Hardware and/or software strip off headers, interpret the remaining encoding in some way, change the data appropriately, and re-encode the bits and bytes, for another layer of hardware and software to deal with. It's a wonder anything ever gets done.


Blob encoding

Git likes to claim that it's entirely agnostic to the actual data stored in your files, as Git blobs. This is even mostly true. Or, well, half true. Or something. As long as all Git is doing is storing your data, it's completely true! Git just stores bytes. What those bytes mean is up to you.

This story falls apart when you run git diff or git merge, because the diff algorithms, and hence the merge code, are line-oriented. Lines are terminated with newlines. (If you're on a system that uses CRLF instead of newline, well, the second charcter of a CRLF pair is a newline, so there's no problem here—and Git is OK with an unterminated final line, though this causes some minor bits of heartburn here and there.) If the file is encoded in UTF-16, a lot of bytes tend to appear to be ASCII NULs, so Git just treats it as binary.

This is fixable: Git could decode the UTF-16 data into UTF-8, feed that data through all of its existing line-oriented algorithms (which would now see newline-terminated lines), and then re-encode the data back to UTF-16. There are a bunch of minor technical issues here; the biggest is deciding that some file is UTF-16, and if so, which endian-ness (UTF-16-LE, or UTF-16-BE?). If the file has a byte order marker, that takes care of the endian issue, and UTF-16-ness could be coded into .gitattributes just as you can currently declare files binary or text, so it's all solvable. It's just messy, and no one has done this work yet.

Footnote-ish: code pages can be considered a (crappy) form of encoding

I mentioned above that the thing we do with Unicode is to encode a 21-bit code point value in some number of eight-bit bytes (1 to 4 bytes in UTF-8, 2 bytes in UTF-16—there's an ugly little trick with what UTF-16 calls surrogates to squeeze 21 bits of value into 16 bits of container, occasionally using pairs of 16-bit values, here). This encoding trick means we can represent all legal 21-bit code point values, though we may need multiple 8-bit bytes to do so.

When we use a code page (CP-number), what we're doing is, or at least can be viewed as, mapping 256 values—those that fit into one 8-bit byte—into that 21-bit code point space. We pick out some subset of no more than 256 such code points and say: These are the code points we'll allow. We encode the first one as, say, 0xa0, the second as 0xa1, and so on. We always leave room for at least a few control codes—usually all 32 in the 0x00 through 0x1f range—and usually we leave the entire 7-bit ASCII subset, as Unicode itself does (see https://en.wikipedia.org/wiki/List_of_Unicode_characters), which is why we most typically start at 0xa0.

When one writes proper Unicode support libraries, code pages simply become translation tables, using just this form of indexing. The hard part is making accurate tables for all the code pages, of which there are very many.

The nice thing about code pages is that characters are once again one-byte-each. The bad thing is that you choose your symbol set once, when you say: I use this code page. From then on, you are locked into this small subset of Unicode. If you switch to another code page, some or all of your eight-bit byte values represent different symbols.


The first mention of UTF-8 in Git codebase dates back from d4a9ce7 (Aug. 2005, v0.99.6), which was about mailingbox patches:

Optionally, with the '-u' flag, the output to .info and .msg is transliterated from its original chaset to utf-8. This is to encourage people to use utf8 in their commit messages for interoperability.

This was signed by Junio C Hamano / 濱野 純 <[email protected]>.

Character encoding was clarified in commit 3a59e59 (July 2017, Git v2.6.0-rc0

That "git is encoding agnostic" is only really true for blob objects.
E.g. the 'non-NUL bytes' requirement of tree and commit objects excludes UTF-16/32, and the special meaning of '/' in the index file as well as space and linefeed in commit objects eliminates EBCDIC and other non-ASCII encodings.

Git expects bytes < 0x80 to be pure ASCII, thus CJK encodings that partly overlap with the ASCII range are problematic as well.
E.g. fmt_ident() removes trailing 0x5C from user names on the assumption that it is ASCII '\'.
However, there are over 200 GBK double byte codes that end in 0x5C.

UTF-8 as default encoding on Linux and respective path translations in the Mac and Windows versions have established UTF-8 NFC as de-facto standard for path names.

See "git, msysgit, accents, utf-8, the definitive answers" for more on that last patch.

The most recent version of Documentation/i18n.txt includes:

Git is to some extent character encoding agnostic.

  • The contents of the blob objects are uninterpreted sequences of bytes. There is no encoding translation at the core level.

  • Path names are encoded in UTF-8 normalization form C.
    This applies to:

    • tree objects,
    • the index file,
    • ref names, as well as path names in
    • command line arguments,
    • environment variables and
    • config files (.git/config, gitignore, gitattributes and gitmodules)

You can see an example of UTF-8 path conversion in commit 0217569 (Jan. 2012, Git v2.1.0-rc0, which added Win32 Unicode file name support.

Changes opendir/readdir to use Windows Unicode APIs and convert between UTF-8/UTF-16.

Regarding command-line arguments, cf. commit 3f04614 (Jan. 2011, Git v2.1.0-rc0), which converts command line arguments from UTF-16 to UTF-8 on startup.


Note: before Git 2.21 (Feb. 2019) the code and tests assume that the system supplied iconv() would always use BOM in its output when asked to encode to UTF-16 (or UTF-32), but apparently some implementations output big-endian without BOM.
A compile-time knob has been added to help such systems (e.g. NonStop) to add BOM to the output to increase portability.

utf8: handle systems that don't write BOM for UTF-16

When serializing UTF-16 (and UTF-32), there are three possible ways to write the stream. One can write the data with a BOM in either big-endian or little-endian format, or one can write the data without a BOM in big-endian format.

Most systems' iconv implementations choose to write it with a BOM in some endianness, since this is the most foolproof, and it is resistant to misinterpretation on Windows, where UTF-16 and the little-endian serialization are very common.
For compatibility with Windows and to avoid accidental misuse there, Git always wants to write UTF-16 with a BOM, and will refuse to read UTF-16 without it.

However, musl's iconv implementation writes UTF-16 without a BOM, relying on the user to interpret it as big-endian. This causes t0028 and the related functionality to fail, since Git won't read the file without a BOM.


git recently has begun to understand encodings such as utf16. See gitattributes docs, search for working-tree-encoding

If you want .txt files to be utf-16 without bom on windows machine then add to your gitattributes file

*.txt text working-tree-encoding=UTF-16LE eol=CRLF

Added in response to @jthill comments above

No doubt that UTF16 is a mess. However consider

  • Java uses UTF16
  • As does Microsoft

    Note the line UTF16… the one used for native Unicode encoding on Windows operating systems

  • Javascript uses a mess between UCS-2 and UTF-16

Tags:

Git

Utf 16