C pointer to array declaration with bitwise and operator

_ctype_ is a pointer to a global array of 257 bytes. I don't know what _ctype_[0] is used for. _ctype_[1] through _ctype_[256]_ represent the character categories of characters 0, …, 255 respectively: _ctype_[c + 1] represents the category of the character c. This is the same thing as saying that _ctype_ + 1 points to an array of 256 characters where (_ctype_ + 1)[c] represents the categorty of the character c.

(_ctype_ + 1)[(unsigned char)_c] is not a declaration. It's an expression using the array subscript operator. It's accessing position (unsigned char)_c of the array that starts at (_ctype_ + 1).

The code casts _c from int to unsigned char is not strictly necessary: ctype functions take char values cast to unsigned char (char is signed on OpenBSD): a correct call is char c; … iscntrl((unsigned char)c). They have the advantage of guaranteeing that there is no buffer overflow: if the application calls iscntrl with a value that is outside the range of unsigned char and isn't -1, this function returns a value which may not be meaningful but at least won't cause a crash or a leak of private data that happened to be at the address outside of the array bounds. The value is even correct if the function is called as char c; … iscntrl(c) as long as c isn't -1.

The reason for the special case with -1 is that it's EOF. Many standard C functions that operate on a char, for example getchar, represent the character as an int value which is the char value wrapped to a positive range, and use the special value EOF == -1 to indicate that no character could be read. For functions like getchar, EOF indicates the end of the file, hence the name end-of-file. Eric Postpischil suggests that the code was originally just return _ctype_[_c + 1], and that's probably right: _ctype_[0] would be the value for EOF. This simpler implementation yields to a buffer overflow if the function is misused, whereas the current implementation avoids this as discussed above.

If v is the value found in the array, v & _C tests if the bit at 0x20 is set in v. The values in the array are masks of the categories that the character is in: _C is set for control characters, _U is set for uppercase letters, etc.


_ctype_ appears to be a restricted internal version of the symbol table and I'm guessing the + 1 is that they didn't bother saving index 0 of it since that one isn't printable. Or possibly they are using a 1-indexed table instead of 0-indexed as is custom in C.

The C standard dictates this for all ctype.h functions:

In all cases the argument is an int, the value of which shall be representable as an unsigned char or shall equal the value of the macro EOF

Going through the code step by step:

  • int iscntrl(int _c) The int types are really characters, but all ctype.h functions are required to handle EOF, so they must be int.
  • The check against -1 is a check against EOF, since it has the value -1.
  • _ctype+1 is pointer arithmetic to get an address of an array item.
  • [(unsigned char)_c] is simply an array access of that array, where the cast is there to enforce the standard requirement of the parameter being representable as unsigned char. Note that char can actually hold a negative value, so this is defensive programming. The result of the [] array access is a single character from their internal symbol table.
  • The & masking is there to get a certain group of characters from the symbol table. Apparently all characters with bit 5 set (mask 0x20) are control characters. There's no making sense of this without viewing the table.
  • Anything with bit 5 set will return the value masked with 0x20, which is a non-zero value. This sates the requirement of the function returning non-zero in case of boolean true.

I'll start with step 3:

increment the adress the undefined pointer points to by 1

The pointer is not undefined. It's just defined in some other compilation unit. That is what the extern part tells the compiler. So when all files are linked together, the linker will resolve the references to it.

So what does it point to?

It points to an array with information about each character. Each character has its own entry. An entry is a bitmap representation of characteristics for the character. For example: If bit 5 is set, it means that the character is a control character. Another example: If bit 0 is set, it means that the character is a upper character.

So something like (_ctype_ + 1)['x'] will get the characteristics that apply to 'x'. Then a bitwise and is performed to check if bit 5 is set, i.e. check whether it is a control character.

The reason for adding 1 is probably that the real index 0 is reserved for some special purpose.

Tags:

C

Openbsd