How to interpret character ranges in charmap files?

glibc allows three-dot decimal ranges (as in POSIX) and two-dot hexadecimal ranges. This doesn't appear to be documented anywhere, but we can see it in the source code. This is not defined portable behaviour, but an extension of glibc and possibly others. If you're writing your own files, use decimal.

Let's confirm that this is the actual behaviour of glibc.

When processing a range, glibc uses:

   if (decimal_ellipsis)
     while (isdigit (*cp) && cp >= from)
       --cp;
   else
     while (isxdigit (*cp) && cp >= from)
       {
         if (!isdigit (*cp) && !isupper (*cp))
           lr_error (lr, _("\
 hexadecimal range format should use only capital characters"));
         --cp;
       }

where isxdigit validates a hex digit, and isdigit decimal. Later, it branches the conversion to integer of the consumed substring in the same way and carries on as you'd expect. Earlier, it has determined the kind of ellipsis in question during parsing, obtained from the lexer.

The UTF-8 charmap file is mechanically generated from unicode.org's UnicodeData.txt, creating 64-codepoint ranges with two dots. I suppose that this convenient auto-generation is at least partially behind the extension, but I don't know. Earlier versions of glibc also generated it, but using a different program and the same format.

Again, this doesn't appear to be documented anywhere, and since it's auto-generated right next to where it's used it conceivably could change, but I imagine it will be stable.

If given something like

<U3400>..<U3430> /xe3/x90/x80 <CJK Ideograph Extension A>

then it is a hexadecimal range, because it uses two dots. With three dots, it would be a POSIX decimal range.

If you're on another system that doesn't have this extension, it would just be a syntax error. A portable character map file should only use the decimal ranges.

How to interpret character ranges in charmap files?

Tags:

Character Encoding

Posix

Locale

Related

Recent Posts