What character encoding should I use for a HTTP header?

In short: Only ASCII is guaranteed to work. Some non-ASCII bytes are allowed for backwards compatibility, but are not supposed to be displayable.

HTTPbis gave up and specified that in the headers there is no useful encoding besides ASCII:

Historically, HTTP has allowed field content with text in the ISO-8859-1 charset [ISO-8859-1], supporting other charsets only through use of [RFC2047] encoding. In practice, most HTTP header field values use only a subset of the US-ASCII charset [USASCII]. Newly defined header fields SHOULD limit their field values to US-ASCII octets. A recipient SHOULD treat other octets in field content (obs-text) as opaque data.


Previously, RFC 2616 from 1999 defined this:

Words of *TEXT MAY contain characters from character sets other than ISO- 8859-1 [22] only when encoded according to the rules of RFC 2047 [14].

and RFC 2047 is the MIME encoding, so it'd be:

=?UTF-8?Q?=E2=9C=B0?=

but I don't think that many (if any) clients support it.


Please read comments first, this answer likely draws wrong conclusions from the right sources, needs edit.


You can use any printable ASCII chars, and no special chars like ✰ (Which is not ASCII)

Tip: you can encode anything in JSON.

Edit: may not be obvious at first, the character encoding defined in the header only applies for the response body, not for the header itself. (As it would cause a chicken-&-egg problem.)


I'd like to sum up all the relevant definitions as per the spec linked by Penchant.

message-header = field-name ":" [ field-value ]
field-name     = token
field-value    = *( field-content | LWS )

So, we are after field-value.

LWS            = [CRLF] 1*( SP | HT )
CRLF           = CR LF
CR             = <US-ASCII CR, carriage return (13)>
LF             = <US-ASCII LF, linefeed (10)>
SP             = <US-ASCII SP, space (32)>
HT             = <US-ASCII HT, horizontal-tab (9)>

LWS stands for Linear White Space. Essentially, LWS is Space or Tab, but you can break your field-value into multiple lines by starting a new line before a Space or Tab.

Let's simplify it to this:

field-value    = <any field-content or Space or Tab>

Now we are after field-content.

field-content  = <the OCTETs making up the field-value
                 and consisting of either *TEXT or combinations
                 of token, separators, and quoted-string>
OCTET          = <any 8-bit sequence of data>
TEXT           = <any OCTET except CTLs,
                 but including LWS>
CTL            = <any US-ASCII control character
                 (octets 0 - 31) and DEL (127)>
token          = 1*<any CHAR except CTLs or separators>
separators     = "(" | ")" | "<" | ">" | "@"
                 | "," | ";" | ":" | "\" | <">
                 | "/" | "[" | "]" | "?" | "="
                 | "{" | "}" | SP | HT

TEXT is the most general and includes all the rest -so forget about the rest-. Here is the US-ASCII charset (= ASCII)

As you can see, all printable ASCII chars are allowed.

Tags:

Http Headers