Fastest Way to Determine if Character Belongs to a Set of Known Characters C++

I went a little further and wrote two versions, one based on a lookup array, the other on a set using an underlying hash.

class CharLookup {
public:
  CharLookup(const std::string & set) : lookup(*std::max_element(set.begin(), set.end()) + 1) {
    for ( auto c : set) lookup[c] = true;
  }
  inline bool has(const unsigned char c) const {
    return c > lookup.size() ? false : lookup[c];
  }
private:
  std::vector<bool> lookup;
};

class CharSet {
public:
  CharSet(const std::string & cset) {
    for ( auto c : cset) set.insert(c);
  }
  inline bool has(const unsigned char c) const {
    return set.contains(c);
  }
private:
  QSet<unsigned char> set;
};

Then wrote a little benchmark, added a few more containers for the sake of comparison. Lower is better, the data points are for "character set size / text size":

enter image description here

Seems like for short char sets and text, std::string::find_first_of is fastest, even faster than using a lookup array, but quickly dwindles as the test size increases. std::vector<bool> seems like the "golden mean", QBitArray probably has a little different implementation because it pulls ahead as the test size increases, at the largest test QVector<bool> is fastest, presumably because it doesn't have the overhead of bit access. The two hash sets are close, trading places, last and least there is the std::set.

Tested on an i7-3770k Win7 x64 box, using MinGW 4.9.1 x32 with -O3.


You could create an array of booleans and assign the value true for each character in wanted set. For example if your wanted set consists of 'a', 'd', 'e':

bool array[256] = {false};
array['a'] = true;
array['d'] = true;
array['e'] = true;

and then you can check a character c:

if (array[c]) ... 

We could also use a bitset for this purpose:

std::bitset<256> b;
b.set('a');
b.set('d');
b.set('e');

and checking as:

if (b.test(c)) ...

Typically this kind of test is not isolated, i.e. you don't just have

if(c==ch1 || c==ch2 || c=ch3 ) { ... }

But

if(c==ch1 || c==ch2 || c=ch3 ) {
    handle_type_a(c);
}
else if(c==ch4 || c==ch5 || c=ch6 ) {
    handle_type_b(c);
}    
else if(c==ch7 || c==ch8 || c=ch9 ) {
    handle_type_c(c);
}

if(c==ch4 || c==ch6 || c=ch7 ) {
    handle_magic(c);
}

Optimising each of the if statements is possibly less efficient than considering all these parts at once. What this kind of structure usually means is that groups of characters are being considered equivalent in some ways - and that is what we might want to express in the code.

In this case, I'd build up a character traits array that contains the character type information.

// First 2 bits contains the "type" of the character
static const unsigned char CHAR_TYPE_BITS = 3;
static const unsigned char CHAR_TYPE_A = 0;  
static const unsigned char CHAR_TYPE_B = 1;
static const unsigned char CHAR_TYPE_C = 2;
// Bit 3 contains whether the character is magic
static const unsigned char CHAR_IS_MAGIC = 4;

static const unsigned char[256] char_traits = {
  ...,
  CHAR_TYPE_A, CHAR_TYPE_B | CHAR_IS_MAGIC ...
  ...
}

static inline unsigned char get_character_type(char c) {
  return char_traits[(unsigned char)c] & CHAR_TYPE_BITS;
}

static inline boolean is_character_magic(char c) {
 return (char_traits[(unsigned char)c] & CHAR_IS_MAGIC) == CHAR_IS_MAGIC;
}

Now your conditions become

switch(get_character_type(c)) { 
 case CHAR_TYPE_A:
    handle_type_a(c);
    break;
 case CHAR_TYPE_B:
    handle_type_b(c);
    break;
 case CHAR_TYPE_C:
    handle_type_c(c);
    break;
}

if(is_character_magic(c)) {
  handle_magic(c);
}

I'd usually extract the char_traits variable into its own include, and generate that include using a simple program too. This keeps things easy to change going forwards.