language+region value of the HTML5 lang attribute

The W3C provides this very long guide on choosing language tags/subtags.

The important bits:

Language tag syntax is defined by the IETF's BCP 47. In the past it was necessary to consult lists of codes in various ISO standards to find the right subtags, but now you only need to look in the IANA Language Subtag Registry. We will describe the new registry below.

This article provides advice on how to choose the components of a language tag. For an overview of the concepts defined in BCP 47, see Language tags in HTML and XML.

...

There are tools available which provide additional help while searching the registry, such as Richard Ishida's Language Subtag Lookup tool.

...

Ensure you have the right language. Sometimes, it pays to check a few alternatives. Mark Davis, co-author of BCP47, writes "Often it is not clear which language identifier to use. For example, what most people call Punjabi in Pakistan actually has the code 'lah', and formal name 'Lahnda'. There are many other cases where the same name is used for different languages, or where the name that people search for is not listed in the IANA registry."

You could look up language information in the SIL Ethnologue and cross-reference that information with Wikipedia. The Ethnologue uses the same three-letter codes as BCP47, but you'll need to convert BCP47 2-letter codes to their ISO 639-3 counterpart to look up a language by code. (Richard Ishida's tool does this for you.)

There are a small number of cases where different language codes are available for what many people would regard as the same language, eg. Filipino and Tagalog, or Twi and Akan. There is no indication in the registry as to which you should use, but you should try to ensure that within a single application or context you are consistent.

(Emphasis mine.)

It should be noted that IANA language subtag registry is kinda hard to use. With the exception of grandfathered-in tags (like en-GB-oed), you have to look up the language family tag and the region/variant subtags separately. And the tags/subtags are organized by type rather than hierarchy. So just save yourself the time and trouble and use Richard Ishida's awesome lookup tool.


[This isn't my strongest area, so I'm just citing documentation here, but it seems you've overlooked something.]

The HTML5 spec requires that the lang value be a valid BCP 47 tag. In that document, the relevant bit seems to be in section 3.4:

For example, an implementation could map the extended language ranges to basic ranges. Another possibility would be for an implementation to return the matching tag that is first in ASCII-order. If the language range were "*-CH" ('CH' represents Switzerland) and the set of tags included "de-CH" (German as used in Switzerland), "fr-CH" (French, Switzerland), and "it-CH" (Italian, Switzerland), then the tag "de-CH" would be returned.

...which when you look at it is basically what you got from the HTML 4 spec citing RFC1766, just in much greater detail.


Using <html lang="fr-FR"> and <html lang="fr-CA"> is fine, if they correspond to the actual content. But they are ignored by search engines, just as <html lang="fr"> is.

HTML5 does not mean to change the use of language codes. The system of the codes as defined in BCP 47 and extensions to it is very elaborate and lets you specify a language variant at painful accurary. The state of the art is at a much, much simpler levels, and fr-FR and fr-CA represent the best granularity you can achieve these days in software; quite often, just the main code (here, fr) matters.

There is no evidence of search engines actually paying any attention to any declarations of language code, such as lang attributes. Other software, such as hyphenators, spelling checkers, speech synthesizers, and default font selection algorithms may take lang attributes into account. But search engines perform their heuristic analyses based on actual content.

It is difficult to blame them for this, since this produces better results than trusting the lang attributes. For example, many authoring tools automatically generate lang="en" irrespective of the actual content, without telling the author.