Unicode character classification

C & C++, 1922 1216 bytes/chars (BMP only 1451 1005 bytes/chars)

This is the "shortest" solution that came to mind that is both valid C and C++ and that supports all of Unicode version 6.1, not just the BMP:

int s[]={-74,-68,31054,-50,-49,-48,-47,-46,-44,-43,-39,-38,-37,-34,-33,-32,-30,-29,-28,-27,-26,-25,-24,-23,-22,-17,-16,-13,-12,-11,-10,-9,-8,-7,-6,-5,-4,-3,-2,0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,16,19,20,27,28,29,53195,31,34,36,42,49,50,52,54,58,59,62,64,66,73,81,98,102,103,140,194,263,275,290,1066,1222,1230,2685,2732,2890,3379,6009,21254,22391,30996,53209};
char l[]="l5Hf`7HIAH2IH@IH7IJFIFHJIJKHLJKFHFJFIJIHIJKIJHJFHNFJ@IH?IHJIK1IBHFJHILDI!HI4HpIFKFHWZ-HIHKFH<IDHIFJHc&H9IQ4IJBIH(Ie+H{)Hh<HI.Hn I@H$I@HPCHRAHPAHPCHRAHPAHP;HJAHPAHPAHPDHIHNKFHIHPEHJHPAHRFHIHrKHKZFLJHPEHLdx'HJKHEILIHICHM$IHOILT,HIM~8IX=IoBIFH/IAHFIJDILIJIUDIk|BHTDHt5Hv*H\\5HYBHI9HY5HYEHFIBHI>HY5HY5HY5HY5HY5HY5HY5HY5HY3H[6HICHY6HICHY6HICHY6HICHY6HICHIG";
char u[]="i5Hm8HIBH^3IJAIJ7IHFIKHFIHIFHJEHIHIFHKHIHEIHIJIHIFHFIHKOEJAIJ@IFJIFH1IOHIHJIEHDIsIKVIFHFIHI:HI@H_JFHK=IMJIHJ#Ha:IQ3IHBIJ)IR,Hy,HIMz IQ%IQAHPCHRAHPAHPCHSEIPAHjEHTEHTEHTDHSEHqLKFHJFHJKDHNEIEHJEHRHMgw'HbIFHJEIEHIJPFH$IPIL\"8IX=IoBIK0IRFIHDILIJIUCI}5Hu*H\1775HY5HY5HYIHFJHJEHIAHY5HYHIEHJAHIBHZHIEHIDHIKBHZ5HY5HY5HY5HY5HY5H]6H^6H^6H^6H^6H^G";
int x(int a,char*p){int c=0,i=0,b=-1;while(c>0||s[*p-32]){if(!c)c=s[*p-32]<0?-s[*p++-32]:1,i=s[*p++-32];b+=i,--c;if(a==b)return 1;}return 0;}

Please note that "shortest" in this case does not mean that the solution is in any way efficient; it simply means that it is the shortest source code solution. It does not require any Unicode support in the standard library, and in fact does not use the standard library at all. (Which I now understand is what code golf means ... I'm a newbie around here, what can I say.)

To build the lower and upper case data tables I worked as follows:

  1. Downloaded UnicodeData.txt.
  2. Extracted a list of all lines that included the text ";Ll;" or ";Lu;" to indicate a lower case letter or upper case letter respectively.
  3. Built two vectors of code points using the data from step 2.
  4. Converted each absolute code point value to a relative value, the difference between the previous code point and this code point. Because it is convenient for the difference to always be a positive integer (not zero or negative), I use -1 as the value of the first previous code point. In this way zero will not appear anywhere in the sequence.
  5. Used a form of run length encoding to compress runs of identical values into a pair of values. If the next value appears two or more times consecutively, replace the sequence with the negative of the length of the sequence and the value. Otherwise the value only appears once, so just use it as is.
  6. Terminated the compressed vector with zero to mark the end of the sequence.
  7. I noticed there were 96 unique values in the run length encoded vectors, so I built an array of the unique integers and used the index into that array for the lower and upper case vectors.

The above process compressed the vector of lower case code points from 1751 unique values to 350 mostly small non-unique values. In like fashion, the upper case code point vector went from 1441 values to 331 values.

Next I wrote the unique value vector out as a comma separated list of integers suitable for including in source code. I assumed int was 32 bits to avoid using long so that I could save an additional four characters / bytes of source code. Then I wrote out the lower and upper case vectors as strings, where each character in the string has a code 32 greater than its index into the unique integer array. This saves about half the characters necessary to encode the lower & upper case array through omission of comma characters. Three of the characters used have to be escaped (\", \, and \177). They are assigned to unique integers that only appear once in the data so as to minimize the size of the string literals.

What I wound up with was a global array of integers, two global arrays of characters, and a function. An array named s for the unique signed integers, an array named u for the character based upper case code point indexes, and another array named l for the character based lower case code point indexes. The function x takes code point a and pointer to an array of characters p and returns 1 if the code point is in the decompressed / normalized array of code points or 0 otherwise.

If you want to determine if code point cp is lower case, you call "x(cp, l)". If you want to determine if it is upper case, you call "x(cp, u)".

Since the point to this exercise is to make the smallest code to accomplish the task, I never save the decompressed / normalized data. I decompress it piecemeal as I need it every time I call it.

It would be possible to terminate the while loop early if the code point I'm checking has already been passed by during decompression. That would add extra code so I didn't bother, deciding I'd rather pay the speed penalty.

I admit the code is ugly. I did not write it this way originally. Here is the C++ version that I derived the compact C version from:

const int Ii[] = {-74,-68,31054,-50,-49,-48,-47,-46,-44,-43,-39,-38,-37,-34,-33,-32,-30,-29,-28,-27,-26,-25,-24,-23,-22,-17,-16,-13,-12,-11,-10,-9,-8,-7,-6,-5,-4,-3,-2,0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,16,19,20,27,28,29,53195,31,34,36,42,49,50,52,54,58,59,62,64,66,73,81,98,102,103,140,194,263,275,290,1066,1222,1230,2685,2732,2890,3379,6009,21254,22391,30996,53209};
const char Ll[] = "l5Hf`7HIAH2IH@IH7IJFIFHJIJKHLJKFHFJFIJIHIJKIJHJFHNFJ@IH?IHJIK1IBHFJHILDI!HI4HpIFKFHWZ-HIHKFH<IDHIFJHc&H9IQ4IJBIH(Ie+H{)Hh<HI.Hn I@H$I@HPCHRAHPAHPCHRAHPAHP;HJAHPAHPAHPDHIHNKFHIHPEHJHPAHRFHIHrKHKZFLJHPEHLdx'HJKHEILIHICHM$IHOILT,HIM~8IX=IoBIFH/IAHFIJDILIJIUDIk|BHTDHt5Hv*H\\5HYBHI9HY5HYEHFIBHI>HY5HY5HY5HY5HY5HY5HY5HY5HY3H[6HICHY6HICHY6HICHY6HICHY6HICHIG";
const char Lu[] = "i5Hm8HIBH^3IJAIJ7IHFIKHFIHIFHJEHIHIFHKHIHEIHIJIHIFHFIHKOEJAIJ@IFJIFH1IOHIHJIEHDIsIKVIFHFIHI:HI@H_JFHK=IMJIHJ#Ha:IQ3IHBIJ)IR,Hy,HIMz IQ%IQAHPCHRAHPAHPCHSEIPAHjEHTEHTEHTDHSEHqLKFHJFHJKDHNEIEHJEHRHMgw'HbIFHJEIEHIJPFH$IPIL\"8IX=IoBIK0IRFIHDILIJIUCI}5Hu*H\1775HY5HY5HYIHFJHJEHIAHY5HYHIEHJAHIBHZHIEHIDHIKBHZ5HY5HY5HY5HY5HY5H]6H^6H^6H^6H^6H^G";

bool isXx(int cp, const char* Xx)
{
    int count = 0, step = 0, code = -1;

    while ((count > 0) || (Ii[*Xx-32] != 0))
    {
        if (count == 0)
        {
            count = (Ii[*Xx-32] < 0) ? -(Ii[*(Xx++)-32]) : 1;
            step = Ii[*(Xx++)-32];
        }

        code += step;
        --count;

        if (cp < code)
            break;

        if (cp == code)
            return true;
    }

    return false;
}

inline bool isLl(int cp)
{
    return isXx(cp, Ll);
}

inline bool isLu(int cp)
{
    return isXx(cp, Lu);
}

In the end, the compact version takes 1922 1216 bytes (including CR+LF end of line sequences). The original version only takes 2271 1578 bytes.

To put this in context of the other solutions that only support the BMP, the equivalent compact version only takes 1451 1005 bytes (or 1800 1367 bytes for the original version).

Tags:

Code Golf