More effective method for Finding the most common character in a string

It is a fast algorithm using much space.

It does not cover full Unicode, there are code points (Unicode characters, ints) that need two chars.

Small optimizations still possible:

  • Making extra versions with byte[] and short[], depending on s.length().
  • Keeping the length() in a variable

    for (int i = 0, n = s.length(); i < n; i++)
    

And yes a HashMap probably is the most "sensible" solution.

Now with java 8, you might turn to parallelism: using multiple cores. Not worth the effort.

int mostFrequentCodePoint = s.codePoints()
    ...

For frequency analysis in natural language, it may suffice to limit the string's length to 1000 or so.


The fastest way to do this will be to count occurrences of every character, then take the max value in the count array. If your string is long, you'll gain a decent speedup from not tracking the current max while looping over characters in the String.

See How to count frequency of characters in a string? for many other ideas about how to count frequencies.

If your Strings are mostly ASCII, a branch in the count loop to choose between an array for the low 128 char values, or a HashMap for the rest, should be worth it. The branch will predict well if your strings don't have non-ASCII characters. If there's a lot of alternating between ascii and non-ascii, the branch might hurt a bit, compared to using HashMap for everything.

public static char getMax(String s) {

    char maxappearchar = ' ';
    int counter = 0;
    int[] ascii_count = new int[128];  // fast path for ASCII
    HashMap<Character,Integer> nonascii_count = new HashMap<Character,Integer>();

    for (int i = 0 ; i < s.length() ; i++)
    {
        char ch = s.charAt(i);  // This does appear to be the recommended way to iterate over a String
        // alternatively, iterate over 32bit Unicode codepoints, not UTF-16 chars, if that matters.
        if (ch < 128) {
            ascii_count[ch]++;
        } else {
            // some code to set or increment the nonascii_count[ch];
        }
    }

    // loop over ascii_count and find the highest element
    // loop over the keys in nonascii_count, and see if any of them are even higher.
    return maxappearchar;
}

I didn't flesh out the code, since I don't do a lot of Java, so IDK if there's a container than can do the insert-1-or-increment operation more efficiently than a HashMap get and put pair. https://stackoverflow.com/a/6712620/224132 suggests Guava MultiSet<Character>, which looks good.


This may do better than your array of 2^16 ints. However, if you only ever touch the low 128 elements of this array, then most of the memory may never be touched. Allocated but untouched memory doesn't really hurt, or use up RAM / swap.

However, looping over all 65536 entries at the end means at least reading it, so the OS would have to soft pagefault it in and wire it up. And it will pollute caches. So actually, updating the max on every character might be a better choice. Microbenchmarks might show that iterating over the String, then looping over charcnt[Character.MAX_VALUE] wins, but that wouldn't account for the cache / TLB pollution of touching that much not-really-needed memory.