Java String Split On Non-Alphabetic Characters

There are already several answers here, but none of them deal well with internationalization issues. And even if it might be assumed from the OP example that it was about "English" letters, it is maybe not the case for visitors coming here from a search...

... so, it worth mentioning that Java supports the Unicode Technical Standard #18 "Unicode Regular Expressions". Pretty impressing isn't it ? In clear, this is an extension to the classic (latin-centric or event English-centric) regular expressions designated to deal with international characters.

For example, Java supports the full set of binary properties to check if a character belong to one of the Unicode code point character classes. Especially the \p{IsAlphabetic} character class would match any alphabetic character corresponding to a letter in any of the Unicode-supported langages.

Not clear ? Here is an example:

    Pattern p = Pattern.compile("\\p{IsAlphabetic}+");
    //                           ^^^^^^^^^^^^^^^^^
    //                         any alphabetic character
    //                    (in any Unicode-supported language)

    Matcher m = p.matcher("L'élève あゆみ travaille _bien_");
    while(m.find()) {
        System.out.println(">" + m.group());
    }

Or mostly equivalent using split to break on non-alphabetic characters:

    for (String s : "L'élève あゆみ travaille bien".split("\\P{IsAlphabetic}+"))
        System.out.println(">" + s);

In both cases, the output will properly tokenize words, taking into account French accentuated characters and Japanese hiragana characters -- just like it would do for words spelled using any Unicode-supported language (including the supplementary multi-lingual plane)


You could try \P{Alpha}+:

"Here is an ex@mple".split("\\P{Alpha}+")
["Here", "is", "an", "ex", "mple"]

\P{Alpha} matches any non-alphabetic character (as opposed to \p{Alpha}, which matches any alphabetic character). + indicates that we should split on any continuous string of such characters. For example:

"a!@#$%^&*b".split("\\P{Alpha}+")
["a", "b"]