Replace the spaces between multiple (3+) capital letters

Overview

There is a way in R to do this using regex entirely, but it's not pretty (although I think it looks pretty sweet!) This answer is also customizable to whatever your needs are (two uppercase minimum, three minimum, etc.) - i.e. scalable - and can match more than one horizontal whitespace characters (doesn't use lookbehinds, which require a fixed width).


Code

See regex in use here

(?:(?=\b(?:\p{Lu}\h+){2}\p{Lu})|\G(?!\A))\p{Lu}\K\h+(?=\p{Lu})

Replacement: Empty string


Edit 1 (non-ASCII letters)

My original pattern used \b, which may not work with Unicode characters (such as É). The following alternative is likely a better approach. It checks to ensure what precedes the first uppercase character is not a letter (from any language/script). It also ensures that it doesn't match an uppercase character at the end of the uppercase series if it is followed by any other letter.

If you also need to ensure numbers don't precede uppercase letters, you can use [^\p{L}\p{N}] in the place of \P{L}.

See regex in use here

(?:(?<=\P{L})(?=(?:\p{Lu}\h+){2}\p{Lu})|\G(?!\A))\p{Lu}\K\h+(?=\p{Lu}(?!\p{L}))

Usage

See code in use here

x <- c(
    "Welcome to A I: the best W O R L D!",
    "Hi I R is the B O M B for sure: we A G R E E indeed."
)
gsub("(?:(?=\\b(?:\\p{Lu}\\h+){2}\\p{Lu})|\\G(?!\\A))\\p{Lu}\\K\\h+(?=\\p{Lu})", "", x, perl=TRUE)

Results

Input

Welcome to A I: the best W O R L D!
Hi I R is the B O M B for sure: we A G R E E indeed.

Output

Welcome to A I: the best WORLD!
Hi I R is the BOMB for sure: we AGREE indeed.

Explanation

  • (?:(?=(?:\b\p{Lu}\h+){2}\p{Lu})|\G(?!\A)) Match either of the following
    • (?=\b(?:\p{Lu}\h+){2}\p{Lu}) Positive lookahead ensuring what follows matches (used as an assertion in this case to find all locations in the string that are in the format A A A). You can also add \b at the end of this positive lookahead to ensure something like I A Name doesn't get matched
      • \b Assert position at a word boundary
      • (?:\p{Lu}\h+){2} Match the following exactly twice
        • \p{Lu} Match an uppercase character in any language (Unicode)
        • \h+ Match one or more horizontal whitespace characters
      • \p{Lu} Match an uppercase character in any language (Unicode)
    • \G(?!\A) Assert position at the end of the previous match
  • \p{Lu} Match an uppercase character in any language (Unicode)
  • \K Resets the starting point of the reported match. Any previously consumed characters are no longer included in the final match
  • \h+ Match one or more horizontal whitespace characters
  • (?=\p{Lu}) Positive lookahead ensuring what follows is an uppercase character in any language (Unicode)

Edit 2 (python)

Below is the python equivalent of above (it requires PyPi regex to run). I replaced \h with [ \t] as PyPi regex doesn't currently support \h token.

See the working code here

import regex
a = [
    "Welcome to A I: the best W O R L D!",
    "Hi I R is the B O M B for sure: we A G R E E indeed."
]

r = regex.compile(r"(?:(?=\b(?:\p{Lu} +){2}\p{Lu})|\G(?!\A))\p{Lu}\K +(?=\p{Lu})")
for i in a:
    print(r.sub('',i))

Above regex based on first regex. If you're looking to use the second regex, use this:

(?:(?<=\P{L})(?=(?:\p{Lu}[ \t]+){2}\p{Lu})|\G(?!\A))\p{Lu}\K[ \t]+(?=\p{Lu}(?!\p{L}))

Using a callback

Please see Wiktor's original answer regarding callbacks, this is simply a ported version of his R program into python. This doesn't use the PyPi regex library and so it won't match. Also, this won't match Unicode.

import re
a = [
    "Welcome to A I: the best W O R L D!",
    "Hi I R is the B O M B for sure: we A G R E E indeed."
]

def repl(m):
    return re.sub(r"\s+",'',m.group(0))

for i in a:
    print(re.sub(r"(?:[A-Z]\s+){2,}[A-Z]", repl, i))

As I pointed out in the comments the problem in the first gsubfn call in the question arises from there being two capture groups in the regex yet only one argument to the function. These need to match -- two capture groups implies a need for two arguments. We can see what gsubfn is passing by running this and viewing the print statement's output:

junk <- gsubfn('(([A-Z]\\s+){2,}[A-Z])', ~ print(list(...)), x)

We can address this in any of the following ways:

1) This uses the regex from the question but uses a function that accepts multiple arguments. Only the first argument is actually used in the function.

gsubfn('(([A-Z]\\s+){2,}[A-Z])', ~ gsub("\\s+", "", ..1), x)
## [1] "Welcome to A I: the best WORLD!"              
## [2] "Hi I R is the BOMB for sure: we AGREE indeed."

Note that it interprets the formula as the function:

function (...) gsub("\\s+", "", ..1)

We can view the function generated from the formula like this:

fn$identity( ~ gsub("\\s+", "", ..1) )
## function (...) 
## gsub("\\s+", "", ..1)

2) This uses the regex from the question and also the function from the question but adds the backref = -1 argument which tells it to pass only the first capture group to the function -- the minus means do not pass the entire match either.

gsubfn('(([A-Z]\\s+){2,}[A-Z])', spacrm1, x, backref = -1)

(As @Wiktor Stribiżew points out in his answer backref=0 would also work.)

3) Another way to express this using the regex from the question is:

gsubfn('(([A-Z]\\s+){2,}[A-Z])', x + y ~ gsub("\\s+", "", x), x)

Note that it interprets the formula as this function:

function(x, y) gsub("\\s+", "", x)

The problem here is what items are passed to the spacrm functions by gsubfn and the mismatch in the number of arguments spacrm function accept and the number of arguments passed to them.

See the gsubfn docs about backref argument:

Number of backreferences to be passed to function. If zero or positive the match is passed as the first argument to the replacement function followed by the indicated number of backreferences as subsequent arguments. If negative then only the that number of backreferences are passed but the match itself is not. If omitted it will be determined automatically, i.e. it will be 0 if there are no backreferences and otherwise it will equal negative the number of back references. It determines this by counting the number of non-escaped left parentheses in the pattern.

So, in your case, the backref argument was omitted, and the spacrmX functions got W O R L D and L values.

The spacrm1 function that only accepts a single argument got two arguments, hence the unused argument ("L ") error.

When spacrm2 was used, it got all two captured values, and they got concatenated (after whitespace removal).

You may actually just use backref=0 to tell the gsubfn to only handle the whole match value and simplify the pattern, remove capturing groups and use one non-capturing instead:

spacrm1 <- function(string) {gsub('\\s+', '', string)}
x <- c(
     'Welcome to A I: the best W O R L D!',
     'Hi I R is the B O M B for sure: we A G R E E indeed.'
)
gsubfn('(?:[A-Z]\\s+){2,}[A-Z]', spacrm2, x, backref=0)
[1] "Welcome to A I: the best WORLD!"              
[2] "Hi I R is the BOMB for sure: we AGREE indeed."

Tags:

Regex

R

Gsubfn