RegEx to split camelCase or TitleCase (advanced)

The following regex works for all of the above examples:

public static void main(String[] args)
{
    for (String w : "camelValue".split("(?<!(^|[A-Z]))(?=[A-Z])|(?<!^)(?=[A-Z][a-z])")) {
        System.out.println(w);
    }
}   

It works by forcing the negative lookbehind to not only ignore matches at the start of the string, but to also ignore matches where a capital letter is preceded by another capital letter. This handles cases like "VALUE".

The first part of the regex on its own fails on "eclipseRCPExt" by failing to split between "RPC" and "Ext". This is the purpose of the second clause: (?<!^)(?=[A-Z][a-z]. This clause allows a split before every capital letter that is followed by a lowercase letter, except at the start of the string.


I couldn't get aix's solution to work (and it doesn't work on RegExr either), so I came up with my own that I've tested and seems to do exactly what you're looking for:

((^[a-z]+)|([A-Z]{1}[a-z]+)|([A-Z]+(?=([A-Z][a-z])|($))))

and here's an example of using it:

; Regex Breakdown:  This will match against each word in Camel and Pascal case strings, while properly handling acrynoms.
;   (^[a-z]+)                       Match against any lower-case letters at the start of the string.
;   ([A-Z]{1}[a-z]+)                Match against Title case words (one upper case followed by lower case letters).
;   ([A-Z]+(?=([A-Z][a-z])|($)))    Match against multiple consecutive upper-case letters, leaving the last upper case letter out the match if it is followed by lower case letters, and including it if it's followed by the end of the string.
newString := RegExReplace(oldCamelOrPascalString, "((^[a-z]+)|([A-Z]{1}[a-z]+)|([A-Z]+(?=([A-Z][a-z])|($))))", "$1 ")
newString := Trim(newString)

Here I'm separating each word with a space, so here are some examples of how the string is transformed:

  • ThisIsATitleCASEString => This Is A Title CASE String
  • andThisOneIsCamelCASE => and This One Is Camel CASE

This solution above does what the original post asks for, but I also needed a regex to find camel and pascal strings that included numbers, so I also came up with this variation to include numbers:

((^[a-z]+)|([0-9]+)|([A-Z]{1}[a-z]+)|([A-Z]+(?=([A-Z][a-z])|($)|([0-9]))))

and an example of using it:

; Regex Breakdown:  This will match against each word in Camel and Pascal case strings, while properly handling acrynoms and including numbers.
;   (^[a-z]+)                               Match against any lower-case letters at the start of the command.
;   ([0-9]+)                                Match against one or more consecutive numbers (anywhere in the string, including at the start).
;   ([A-Z]{1}[a-z]+)                        Match against Title case words (one upper case followed by lower case letters).
;   ([A-Z]+(?=([A-Z][a-z])|($)|([0-9])))    Match against multiple consecutive upper-case letters, leaving the last upper case letter out the match if it is followed by lower case letters, and including it if it's followed by the end of the string or a number.
newString := RegExReplace(oldCamelOrPascalString, "((^[a-z]+)|([0-9]+)|([A-Z]{1}[a-z]+)|([A-Z]+(?=([A-Z][a-z])|($)|([0-9]))))", "$1 ")
newString := Trim(newString)

And here are some examples of how a string with numbers is transformed with this regex:

  • myVariable123 => my Variable 123
  • my2Variables => my 2 Variables
  • The3rdVariableIsHere => The 3 rdVariable Is Here
  • 12345NumsAtTheStartIncludedToo => 12345 Nums At The Start Included Too

It seems you are making this more complicated than it needs to be. For camelCase, the split location is simply anywhere an uppercase letter immediately follows a lowercase letter:

(?<=[a-z])(?=[A-Z])

Here is how this regex splits your example data:

  • value -> value
  • camelValue -> camel / Value
  • TitleValue -> Title / Value
  • VALUE -> VALUE
  • eclipseRCPExt -> eclipse / RCPExt

The only difference from your desired output is with the eclipseRCPExt, which I would argue is correctly split here.

Addendum - Improved version

Note: This answer recently got an upvote and I realized that there is a better way...

By adding a second alternative to the above regex, all of the OP's test cases are correctly split.

(?<=[a-z])(?=[A-Z])|(?<=[A-Z])(?=[A-Z][a-z])

Here is how the improved regex splits the example data:

  • value -> value
  • camelValue -> camel / Value
  • TitleValue -> Title / Value
  • VALUE -> VALUE
  • eclipseRCPExt -> eclipse / RCP / Ext

Edit:20130824 Added improved version to handle RCPExt -> RCP / Ext case.


Another solution would be to use a dedicated method in commons-lang: StringUtils#splitByCharacterTypeCamelCase