re.findall('(ab|cd)', string) vs re.findall('(ab|cd)+', string)

+ is a repeat quantifier that matches one or more times. In the regex (ab|cd)+, you are repeating the capture group (ab|cd) using +. This will only capture the last iteration.

You can reason about this behaviour as follows:

Say your string is abcdla and regex is (ab|cd)+. Regex engine will find a match for the group between positions 0 and 1 as ab and exits the capture group. Then it sees + quantifier and so tries to capture the group again and will capture cd between positions 2 and 3.


If you want to capture all iterations, you should capture the repeating group instead with ((ab|cd)+) which matches abcd and cd. You can make the inner group non-capturing as we don't care about inner group matches with ((?:ab|cd)+) which matches abcd

https://www.regular-expressions.info/captureall.html

From the Docs,

Let’s say you want to match a tag like !abc! or !123!. Only these two are possible, and you want to capture the abc or 123 to figure out which tag you got. That’s easy enough: !(abc|123)! will do the trick.

Now let’s say that the tag can contain multiple sequences of abc and 123, like !abc123! or !123abcabc!. The quick and easy solution is !(abc|123)+!. This regular expression will indeed match these tags. However, it no longer meets our requirement to capture the tag’s label into the capturing group. When this regex matches !abc123!, the capturing group stores only 123. When it matches !123abcabc!, it only stores abc.


I don't know if this will clear things more, but let's try to imagine what happen under the hood in a simple way, we going to sumilate what happen using match

   # group(0) return the matched string the captured groups are returned in groups or you can access them
   # using group(1), group(2).......  in your case there is only one group, one group will capture only 
   # one part so when you do this
   string = 'abcdla'
   print(re.match('(ab|cd)', string).group(0))  # only 'ab' is matched and the group will capture 'ab'
   print(re.match('(ab|cd)+', string).group(0)) # this will match 'abcd'  the group will capture only this part 'cd' the last iteration

findall match and consume the string at the same time let's imagine what happen with this REGEX '(ab|cd)':

      'abcdabla' ---> 1:   match: 'ab' |  capture : ab  | left to process:  'cdabla'
      'cdabla'   ---> 2:   match: 'cd' |  capture : cd  | left to process:  'abla'
      'abla'     ---> 3:   match: 'ab' |  capture : ab  | left to process:  'la'
      'la'       ---> 4:   match: '' |  capture : None  | left to process:  ''

      --- final : result captured ['ab', 'cd', 'ab']  

Now the same thing with '(ab|cd)+'

      'abcdabla' ---> 1:   match: 'abcdab' |  capture : 'ab'  | left to process:  'la'
      'la'       ---> 2:   match: '' |  capture : None  | left to process:  ''
      ---> final result :   ['ab']  

I hope this clears thing a little bit.

Tags:

Python

Regex