How to do CamelCase split in python

Working solution, without regexp

I am not that good at regexp. I like to use them for search/replace in my IDE but I try to avoid them in programs.

Here is a quite straightforward solution in pure python:

def camel_case_split(s):
    idx = list(map(str.isupper, s))
    # mark change of case
    l = [0]
    for (i, (x, y)) in enumerate(zip(idx, idx[1:])):
        if x and not y:  # "Ul"
            l.append(i)
        elif not x and y:  # "lU"
            l.append(i+1)
    l.append(len(s))
    # for "lUl", index of "U" will pop twice, have to filter that
    return [s[x:y] for x, y in zip(l, l[1:]) if x < y]






And some tests

def test():
    TESTS = [
        ("aCamelCaseWordT", ['a', 'Camel', 'Case', 'Word', 'T']),
        ("CamelCaseWordT", ['Camel', 'Case', 'Word', 'T']),
        ("CamelCaseWordTa", ['Camel', 'Case', 'Word', 'Ta']),
        ("aCamelCaseWordTa", ['a', 'Camel', 'Case', 'Word', 'Ta']),
        ("Ta", ['Ta']),
        ("aT", ['a', 'T']),
        ("a", ['a']),
        ("T", ['T']),
        ("", []),
        ("XYZCamelCase", ['XYZ', 'Camel', 'Case']),
        ("CamelCaseXYZ", ['Camel', 'Case', 'XYZ']),
        ("CamelCaseXYZa", ['Camel', 'Case', 'XY', 'Za']),
    ]
    for (q,a) in TESTS:
        assert camel_case_split(q) == a

if __name__ == "__main__":
    test()

Use re.sub() and split()

import re

name = 'CamelCaseTest123'
splitted = re.sub('([A-Z][a-z]+)', r' \1', re.sub('([A-Z]+)', r' \1', name)).split()

Result

'CamelCaseTest123' -> ['Camel', 'Case', 'Test123']
'CamelCaseXYZ' -> ['Camel', 'Case', 'XYZ']
'XYZCamelCase' -> ['XYZ', 'Camel', 'Case']
'XYZ' -> ['XYZ']
'IPAddress' -> ['IP', 'Address']

Most of the time when you don't need to check the format of a string, a global research is more simple than a split (for the same result):

re.findall(r'[A-Z](?:[a-z]+|[A-Z]*(?=[A-Z]|$))', 'CamelCaseXYZ')

returns

['Camel', 'Case', 'XYZ']

To deal with dromedary too, you can use:

re.findall(r'[A-Z]?[a-z]+|[A-Z]+(?=[A-Z]|$)', 'camelCaseXYZ')

Note: (?=[A-Z]|$) can be shorten using a double negation (a negative lookahead with a negated character class): (?![^A-Z])


As @AplusKminus has explained, re.split() never splits on an empty pattern match. Therefore, instead of splitting, you should try finding the components you are interested in.

Here is a solution using re.finditer() that emulates splitting:

def camel_case_split(identifier):
    matches = finditer('.+?(?:(?<=[a-z])(?=[A-Z])|(?<=[A-Z])(?=[A-Z][a-z])|$)', identifier)
    return [m.group(0) for m in matches]