How to setup a grammar that can handle ambiguity

import lark
grammar = r'''start: instruction

?instruction: simple
            | func

MIDTEXTRPAR: /\)+(?!(\)|,,|$))/
SINGLESTR: (LETTER+|DIGIT+|"_"|" ") (LETTER+|DIGIT+|"_"|" "|"("|MIDTEXTRPAR)*
FUNCNAME: (LETTER+) (LETTER+|DIGIT+|"_")* // no parentheses allowed in the func name
DB: "!" SINGLESTR (WORDSEP SINGLESTR)*
TEXT: "$" SINGLESTR
MD: "#" SINGLESTR
simple: TEXT|DB|MD
ARGSEP: ",," // argument separator
WORDSEP: "," // word separator
CONDSEP: ";;" // condition separator
STAR: "*"
func: "&" FUNCNAME "(" [simple|func] (ARGSEP simple|func)* ")"

%import common.LETTER
%import common.WORD
%import common.DIGIT
%ignore ARGSEP
%ignore WORDSEP
'''

parser = lark.Lark(grammar, parser='earley')
parser.parse("&foo($first arg (has) parentheses,,$second arg)")

Output:

Tree(start, [Tree(func, [Token(FUNCNAME, 'foo'), Tree(simple, [Token(TEXT, '$first arg (has) parentheses')]), Token(ARGSEP, ',,'), Tree(simple, [Token(TEXT, '$second arg')])])])

I hope it's what you were looking for.

Those have been crazy few days. I tried lark and failed. I also tried persimonious and pyparsing. All of these different parsers all had the same problem with the 'argument' token consuming the right parenthesis that was part of the function, eventually failing because the function's parentheses weren't closed.

The trick was to figure out how do you define a right parenthesis that's "not special". See the regular expression for MIDTEXTRPAR in the code above. I defined it as a right parenthesis that is not followed by argument separation or by end of string. I did that by using the regular expression extension (?!...) which matches only if it's not followed by ... but doesn't consume characters. Luckily it even allows matching end of string inside this special regular expression extension.

EDIT:

The above mentioned method only works if you don't have an argument ending with a ), because then the MIDTEXTRPAR regular expression won't catch that ) and will think that's the end of the function even though there are more arguments to process. Also, there may be ambiguities such as ...asdf),,..., it may be an end of a function declaration inside an argument, or a 'text-like' ) inside an argument and the function declaration goes on.

This problem is related to the fact that what you describe in your question is not a context-free grammar (https://en.wikipedia.org/wiki/Context-free_grammar) for which parsers such as lark exist. Instead it is a context-sensitive grammar (https://en.wikipedia.org/wiki/Context-sensitive_grammar).

The reason for it being a context sensitive grammar is because you need the parser to 'remember' that it is nested inside a function, and how many levels of nesting there are, and have this memory available inside the grammar's syntax in some way.

EDIT2:

Also take a look at the following parser that is context-sensitive, and seems to solve the problem, but has an exponential time complexity in the number of nested functions, as it tries to parse all possible function barriers until it finds one that works. I believe it has to have an exponential complexity has since it's not context-free.


_funcPrefix = '&'
_debug = False

class ParseException(Exception):
    pass

def GetRecursive(c):
    if isinstance(c,ParserBase):
        return c.GetRecursive()
    else:
        return c

class ParserBase:
    def __str__(self):
        return type(self).__name__ + ": [" + ','.join(str(x) for x in self.contents) +"]"
    def GetRecursive(self):
        return (type(self).__name__,[GetRecursive(c) for c in self.contents])

class Simple(ParserBase):
    def __init__(self,s):
        self.contents = [s]

class MD(Simple):
    pass

class DB(ParserBase):
    def __init__(self,s):
        self.contents = s.split(',')

class Func(ParserBase):
    def __init__(self,s):
        if s[-1] != ')':
            raise ParseException("Can't find right parenthesis: '%s'" % s)
        lparInd = s.find('(')
        if lparInd < 0:
            raise ParseException("Can't find left parenthesis: '%s'" % s)
        self.contents = [s[:lparInd]]
        argsStr = s[(lparInd+1):-1]
        args = list(argsStr.split(',,'))
        i = 0
        while i<len(args):
            a = args[i]
            if a[0] != _funcPrefix:
                self.contents.append(Parse(a))
                i += 1
            else:
                j = i+1
                while j<=len(args):
                    nestedFunc = ',,'.join(args[i:j])
                    if _debug:
                        print(nestedFunc)
                    try:
                        self.contents.append(Parse(nestedFunc))
                        break
                    except ParseException as PE:
                        if _debug:
                            print(PE)
                        j += 1
                if j>len(args):
                    raise ParseException("Can't parse nested function: '%s'" % (',,'.join(args[i:])))
                i = j

def Parse(arg):
    if arg[0] not in _starterSymbols:
        raise ParseException("Bad prefix: " + arg[0])
    return _starterSymbols[arg[0]](arg[1:])

_starterSymbols = {_funcPrefix:Func,'$':Simple,'!':DB,'#':MD}

P = Parse("&foo($first arg (has)) parentheses,,&f($asdf,,&nested2($23423))),,&second(!arg,wer))")
print(P)

import pprint
pprint.pprint(P.GetRecursive())