How to get the longest bracket pairs from a string

First Case

str = "xx <aa <bbb> <bbb> aa> yy<<dfa>a>";
StringCases[str, 
   RegularExpression["(?P<a><([^<>]|(?P>a))*>)"]
]
(* {"<aa <bbb> <bbb> aa>", "<<dfa>a>"} *)

This works as follows:

  • (?P<a> ...) names a the pattern <([^<>]|(?P>a))*>.
  • The string or substring matching this pattern must start with < and end with >.
  • Within these characters, the pattern ([^<>]|(?P>a)) can be repeated 0 or more times.
  • This subpattern says that no character can be < or >. If such a character is met while reading the string, then the pattern a is called by (?P>a) and we start again at bullet 2 with the substring starting with this character.

Second Case

str2 = "dd9[ab*[c]d]esiddx(45x(b(x99))"
StringCases[str2, 
   RegularExpression["(?P<a>(\\[|\\()([^\\[\\]\\(\\)]|(?P>a))*(\\]|\\)))"]
]
(* {"[ab*[c]d]", "(b(x99))"} *)

This works as above. Here, instead of < at the beginning of the (sub)string, we allow for [ or ( with (\\[|\\(). The other modifications are in line with this change.

Note that this regular expression may not be satisfying for cases such as

str3 = "dd9[ab*[c]d)esiddx(45x(b(x99))";
(* The square bracket after d is replaced by a parenthesis. *)

StringCases[str3, 
   RegularExpression["(?P<a>(\\[|\\()([^\\[\\]\\(\\)]|(?P>a))*(\\]|\\)))"]
]
(* {"[ab*[c]d)", "(b(x99))"} *)

The first element starts with a [ and ends with ). This can be avoided by adding a pattern and a condition test on this pattern:

StringCases[str3, 
   RegularExpression["(?P<a>((?P<b>\\[)|\\()([^\\[\\]\\(\\)]|(?P>a))*(?(b)\\]|\\)))"]
]
(* {"[c]", "(b(x99))"} *)

The starting [ is referred to as b. The pattern (?(b)\\]|\\)) tells us that if b had a match, then the character to match should be ], or otherwise ).


This works:

str = "xx <aa <bbb> <bbb> aa> yy<<dfa>a>";

StringCases[str, "<" ~~ Shortest@s___ ~~ ">" /; StringCount[s, "<"] == StringCount[s, ">"]]
{"<aa <bbb> <bbb> aa>", "<<dfa>a>"}

Or equivalently

StringCases[str, 
 s : RegularExpression["<.*?>"] /; StringCount[s, "<"] == StringCount[s, ">"]]
{"<aa <bbb> <bbb> aa>", "<<dfa>a>"}

Of course it isn't a pure regex approach: the method uses Condition. Similar approach is used in this answer of mine where an extended explanation of joint working of Condition together with lazy quantifier Shortest (or *? in regex) is given.


The second problem can be solved using two patterns of the same type as alternatives:

Clear[balanced]
balanced[{l_, r_}] := 
 HoldPattern[(left : l ~~ Shortest@s___ ~~ right : r) /; 
   StringCount[s, left] == StringCount[s, right]]

str2 = "dd9[ab*[c]d]esiddx(45x(b(x99))";

StringCases[str2, balanced /@ {{"[", "]"}, {"(", ")"}}]
{"[ab*[c]d]", "(b(x99))"}

Or we can combine them into single pattern as follows:

StringCases[str2, (left : "[" | "(" ~~ Shortest@s___ ~~ right : "]" | ")") /; 
  MatchQ[{left, right}, {"[", "]"} | {"(", ")"}] && 
   StringCount[s, left] == StringCount[s, right]]
{"[ab*[c]d]", "(b(x99))"}

Not a regular expression but counting the left and right separators to find positions where they're equal in number can find top level bracket pairs:

str1 = "xx<aa<bbb> <bbb>aa>yy<<dfa>a>";
str2 = "dd9[ab*[c]d]esiddx(45x(b(x99))";

f[l_, r_, str_] := Module[{sum, pos},
   sum = Accumulate[StringCases[str, l | r] /. {l -> 1, r -> -1}];
   pos = First /@ StringPosition[str, (l | r)];
   Partition[(First /@ 
       SplitBy[Transpose[{sum, pos}], #[[1]] == 0 &])[[All, 2]], 2]
   ];

Works for strings with complete pairs:

f["<", ">", str1]
f["[", "]", str2]
{{3, 19}, {22, 29}}
{{4, 12}}

But does not work for e.g. f["(", ")", str2]because str2 has one more opening ( than ).