Separate title string with no spaces into words

Aiming for simplicity rather than huge regex, I would recommend this code with small simple patterns (comments with explanation are in code):

string str = "ThisIsAnExampleTitleHELLO-WORLD2019T.E.S.T.(Test)\"Test\"'Test'[Test]";
// insert space when there is small letter followed by upercase letter
str = Regex.Replace(str, "(?<=[a-z])(?=[A-Z])", " ");
// insert space whenever there's digit followed by a ltter
str = Regex.Replace(str, @"(?<=\d)(?=[A-Za-z])", " ");
// insert space when there's letter followed by digit
str = Regex.Replace(str, @"(?<=[A-Za-z])(?=\d)", " ");
// insert space when there's one of characters ("'[ followed by letter or digit
str = Regex.Replace(str, @"(?=[(\[""'][a-zA-Z0-9])", " ");
// insert space when what preceeds is on of characters ])"'
str = Regex.Replace(str, @"(?<=[)\]""'])", " ");

First few parts are similar to @revo answer: (?<!^|[A-Z\p{P}])[A-Z]|(?<=\p{P})\p{P}, additionally I add the following regex to space between number and letter: (?<=[a-z])(?=\d)|(?<=\d)(?=[a-z])|(?<=[A-Z])(?=\d)|(?<=\d)(?=[A-Z]) and to detect OTPIsADevice then replace with lookahead and lookbehind to find uppercase with a lowercase: (((?<!^)[A-Z](?=[a-z]))|((?<=[a-z])[A-Z]))

Note that | is or operator which allowed all the regex to be executed.

Regex: (?<!^|[A-Z\p{P}])[A-Z]|(?<=\p{P})\p{P}|(?<=[a-z])(?=\d)|(?<=\d)(?=[a-z])|(?<=[A-Z])(?=\d)|(?<=\d)(?=[A-Z])|(((?<!^)[A-Z](?=[a-z]))|((?<=[a-z])[A-Z]))

Demo

Update

Improvised a bit:

From: (?<!^|[A-Z\p{P}])[A-Z]|(?<=\p{P})\p{P}|(?<=[a-z])(?=\d)|(?<=\d)(?=[a-z])|(?<=[A-Z])(?=\d)|(?<=\d)(?=[A-Z])

into: (?<!^|[A-Z\p{P}])[A-Z]|(?<=\p{P})\p{P}|(?<=\p{L})\d which do the same thing.

(((?<!^)(?<!\p{P})[A-Z](?=[a-z]))|((?<=[a-z])[A-Z]))|(?<!^)(?=[[({&])|(?<=[)\]}!&}]) improvised from OP comment which is adding exception to some punctuation: (((?<!^)(?<!['([{])[A-Z](?=[a-z]))|((?<=[a-z])[A-Z]))|(?<!^)(?=[[({&])|(?<=[)\\]}!&}])

Final regex: (?<!^|[A-Z\p{P}])[A-Z]|(?<=\p{P})\p{P}|(?<=\p{L})\d|(((?<!^)(?<!\p{P})[A-Z](?=[a-z]))|((?<=[a-z])[A-Z]))|(?<!^)(?=[[({&])|(?<=[)\]}!&}])

Demo


Here is a regex which seems to work well, at least for your sample input:

(?<=[a-z])(?=[A-Z])|(?<=[0-9])(?=[A-Za-z])|(?<=[A-Za-z])(?=[0-9])|(?<=\W)(?=\W)

This patten says to make a split on a boundary of one of the following conditions:

  • what precedes is a lowercase, and what precedes is an uppercase (or vice-versa)
  • what precedes is a digit and what follows is a letter (or vice-versa)
  • what precedes and what follows is a non word character (e.g. quote, parenthesis, etc.)


string title = "ThisIsAnExampleTitleHELLO-WORLD2019T.E.S.T.(Test)[Test]\"Test\"'Test'";
string[] split =  Regex.Split(title, "(?<=[a-z])(?=[A-Z])|(?<=[0-9])(?=[A-Za-z])|(?<=[A-Za-z])(?=[0-9])|(?<=\\W)(?=\\W)"); 
split = (from e in split select e.Trim()).ToArray();
string newtitle = string.Join(" ", split);

This Is An Example Title HELLO-WORLD 2019 T.E.S.T. (Test) [Test] "Test" 'Test'

Note: You might also want to add this assertion to the regex alternation:

(?<=\W)(?=\w)|(?<=\w)(?=\W)

We got away with this here, because this boundary condition never happened. But you might need it with other inputs.

Tags:

C#

Regex