'Reliable' SMS Unicode & GSM Encoding in PHP

To deal with it conceptually before getting into mechanisms, and apologies if any of this is obvious, a string can be defined as a sequence of Unicode characters, Unicode being a database that gives an id number known as a code point to every character you might need to work with. GSM-338 contains a subset of the Unicode characters, so what you're doing is extracting a set of codepoints from your string, and checking to see if that set is contained in GSM-338.

// second column of http://unicode.org/Public/MAPPINGS/ETSI/GSM0338.TXT
$gsm338_codepoints = array(0x0040, 0x0000, ..., 0x00fc, 0x00e0)
$can_use_gsm338 = true;
foreach(codepoints($mystring) as $codepoint){
    if(!in_array($codepoint, $gsm338_codepoints)){
      $can_use_gsm338 = false;
      break;
    }
}

That leaves the definition of the function codepoints($string), which isn't built in to PHP. PHP understands a string to be a sequence of bytes rather than a sequence of Unicode characters. The best way of bridging the gap is to get your strings into UTF8 as quickly as you can and keep them in UTF8 as long as you can - you'll have to use other encodings when dealing with external systems, but isolate the conversion to the interface to that system and deal only with utf8 internally.

The functions you need to convert between php strings in utf8 and sequences of codepoints can be found at http://hsivonen.iki.fi/php-utf8/ , so that's your codepoints() function.

If you're taking data from an external source that gives you Unicode slash-escaped characters ("Let's test \u00f6\u00e4\u00fc..."), that string escape format should be converted to utf8. I don't know offhand of a function to do this, if one can't be found, it's a matter of string/regex processing + the use of the hsivonen.iki.fi functions, for example when you hit \u00f6, replace it with the utf8 representation of the codepoint 0xf6.


Although this is an old thread I recently had to solve a very similar problem and wanted to post my answer. The PHP code is somewhat simple. It starts with a painstakingly large array of GSM valid character codes in an array, then simply checks if the current character is in that array using the ord($string) function which returns the ascii value of the first character of the string passed. Here is the code I use to validate if a string is GSM worth.

    $valid_gsm_keycodes = Array(   
        0x0040, 0x0394, 0x0020, 0x0030, 0x00a1, 0x0050, 0x00bf, 0x0070,
        0x00a3, 0x005f, 0x0021, 0x0031, 0x0041, 0x0051, 0x0061, 0x0071,
        0x0024, 0x03a6, 0x0022, 0x0032, 0x0042, 0x0052, 0x0062, 0x0072,
        0x00a5, 0x0393, 0x0023, 0x0033, 0x0043, 0x0053, 0x0063, 0x0073,
        0x00e8, 0x039b, 0x00a4, 0x0034, 0x0035, 0x0044, 0x0054, 0x0064, 0x0074,
        0x00e9, 0x03a9, 0x0025, 0x0045, 0x0045, 0x0055, 0x0065, 0x0075,
        0x00f9, 0x03a0, 0x0026, 0x0036, 0x0046, 0x0056, 0x0066, 0x0076,
        0x00ec, 0x03a8, 0x0027, 0x0037, 0x0047, 0x0057, 0x0067, 0x0077, 
        0x00f2, 0x03a3, 0x0028, 0x0038, 0x0048, 0x0058, 0x0068, 0x0078,
        0x00c7, 0x0398, 0x0029, 0x0039, 0x0049, 0x0059, 0x0069, 0x0079,
        0x000a, 0x039e, 0x002a, 0x003a, 0x004a, 0x005a, 0x006a, 0x007a,
        0x00d8, 0x001b, 0x002b, 0x003b, 0x004b, 0x00c4, 0x006b, 0x00e4,
        0x00f8, 0x00c6, 0x002c, 0x003c, 0x004c, 0x00d6, 0x006c, 0x00f6,
        0x000d, 0x00e6, 0x002d, 0x003d, 0x004d, 0x00d1, 0x006d, 0x00f1,
        0x00c5, 0x00df, 0x002e, 0x003e, 0x004e, 0x00dc, 0x006e, 0x00fc,
        0x00e5, 0x00c9, 0x002f, 0x003f, 0x004f, 0x00a7, 0x006f, 0x00e0 );


        for($i = 0; $i < strlen($string); $i++) {
            if(!in_array($string[$i], $valid_gsm_keycodes)) return false;
        }

        return true;

Tags:

Php

Unicode

Sms

Gsm