How can I match nested brackets using regex?

Many regex implementations will not allow you to match an arbitrary amount of nesting. However, Perl, PHP and .NET support recursive patterns.

A demo in Perl:

#!/usr/bin/perl -w

my $text = '(outer
   (center
     (inner)
     (inner)
   center)
 ouer)
 (outer
   (inner)
 ouer)
 (outer
 ouer)';

while($text =~ /(\(([^()]|(?R))*\))/g) {
  print("----------\n$1\n");
}

which will print:

----------
(outer
   (center
     (inner)
     (inner)
   center)
 ouer)
----------
(outer
   (inner)
 ouer)
----------
(outer
 ouer)

Or, the PHP equivalent:

$text = '(outer
   (center
     (inner)
     (inner)
   center)
 ouer)
 (outer
   (inner)
 ouer)
 (outer
 ouer)';

preg_match_all('/(\(([^()]|(?R))*\))/', $text, $matches);

print_r($matches);

which produces:

Array
(
    [0] => Array
        (
            [0] => (outer
   (center
     (inner)
     (inner)
   center)
 ouer)
            [1] => (outer
   (inner)
 ouer)
            [2] => (outer
 ouer)
        )

...

An explanation:

(          # start group 1
  \(       #   match a literal '('
  (        #   group 2
    [^()]  #     any char other than '(' and ')'
    |      #     OR
    (?R)   #     recursively match the entir pattern
  )*       #   end group 2 and repeat zero or more times
  \)       #   match a literal ')'
)          # end group 1

EDIT

Note @Goozak's comment:

A better pattern might be \(((?>[^()]+)|(?R))*\) (from PHP:Recursive patterns). For my data, Bart's pattern was crashing PHP when it encountered a (long string) without nesting. This pattern went through all my data without problem.


Don't use regex.

Instead, a simple recursive function will suffice. Here's the general structure:

def recursive_bracket_parser(s, i):
    while i < len(s):
        if s[i] == '(':
            i = recursive_bracket_parser(s, i+1)
        elif s[i] == ')':
            return i+1
        else:
            # process whatever is at s[i]
            i += 1
    return i

For example, here's a function that will parse the input into a nested list structure:

def parse_to_list(s, i=0):
    result = []
    while i < len(s):
        if s[i] == '(':
            i, r = parse_to_list(s, i+1)
            result.append(r)
        elif s[i] == ')':
            return i+1, result
        else:
            result.append(s[i])
            i += 1
    return i, result

Calling this like parse_to_list('((a) ((b)) ((c)(d)))efg') produces the result [[['a'], ' ', [['b']], ' ', [['c'], ['d']]], 'e', 'f', 'g'].

Tags:

Regex

Nested