Simply using parsec in python

I encourage you to define your own parser using those combinators, rather than construct the Parser directly.

If you want to construct a Parser by wrapping a function, as the documentation states, the fn should accept two arguments, the first is the text and the second is the current position. And fn should return a Value by Value.success or Value.failure, rather than a boolean. You can grep @Parser in the parsec/__init__.py in this package to find more examples of how it works.

For your case in the description, you could define the parser as follows:

from parsec import *

spaces = regex(r'\s*', re.MULTILINE)
name = regex(r'[_a-zA-Z][_a-zA-Z0-9]*')

tag_start = spaces >> string('<') >> name << string('>') << spaces
tag_stop = spaces >> string('</') >> name << string('>') << spaces

@generate
def header_kv():
    key = yield spaces >> name << spaces
    yield string(':')
    value = yield spaces >> regex('[^\n]+')
    return {key: value}

@generate
def header():
    tag_name = yield tag_start
    values = yield sepBy(header_kv, string('\n'))
    tag_name_end = yield tag_stop
    assert tag_name == tag_name_end
    return {
        'type': 'tag',
        'name': tag_name,
        'values': values
    }

@generate
def body():
    tag_name = yield tag_start
    values = yield sepBy(sepBy1(regex(r'[^\n<,]+'), string(',')), string('\n'))
    tag_name_end = yield tag_stop
    assert tag_name == tag_name_end
    return {
        'type': 'tag',
        'name': tag_name,
        'values': values
    }

parser = header + body

If you run parser.parse(mystr), it yields

({'type': 'tag',
  'name': 'kv',
  'values': [{'key1': '"string"'},
             {'key2': '1.00005'},
             {'key3': '[1,2,3]'}]},
 {'type': 'tag',
  'name': 'csv',
  'values': [['date', 'windspeed', 'direction'],
             ['20190805', '22', 'NNW'],
             ['20190805', '23', 'NW'],
             ['20190805', '20', 'NE']]}
)

You can refine the definition of values in the above code to get the result in the exact form you want.


According to the tests, the proper way to parse your string would be the following:

from parsec import *

possible_chars = letter() | space() |  one_of('/.,:"[]') | digit()
parser =  many(many(possible_chars) + string("<") >> mark(many(possible_chars)) << string(">"))

parser.parse(mystr)
# [((1, 1), ['k', 'v'], (1, 3)), ((5, 1), ['/', 'k', 'v'], (5, 4)), ((6, 1), ['c', 's', 'v'], (6, 4)), ((11, 1), ['/', 'c', 's', 'v'], (11, 5))]

The construction of the parser:


For the sake of convenience, we first define the characters we wish to match. parsec provides many types:

  • letter(): matches any alphabetic character,

  • string(str): matches any specified string str,

  • space(): matches any whitespace character,

  • spaces(): matches multiple whitespace characters,

  • digit(): matches any digit,

  • eof(): matches EOF flag of a string,

  • regex(pattern): matches a provided regex pattern,

  • one_of(str): matches any character from the provided string,

  • none_of(str): match characters which are not in the provided string.


We can separate them with operators, according to the docs:

  • |: This combinator implements choice. The parser p | q first applies p. If it succeeds, the value of p is returned. If p fails without consuming any input, parser q is tried. NOTICE: without backtrack,

  • +: Joint two or more parsers into one. Return the aggregate of two results from this two parser.

  • ^: Choice with backtrack. This combinator is used whenever arbitrary look ahead is needed. The parser p || q first applies p, if it success, the value of p is returned. If p fails, it pretends that it hasn't consumed any input, and then parser q is tried.

  • <<: Ends with a specified parser, and at the end parser consumed the end flag,

  • <: Ends with a specified parser, and at the end parser hasn't consumed any input,

  • >>: Sequentially compose two actions, discarding any value produced by the first,

  • mark(p): Marks the line and column information of the result of the parser p.


Then there are multiple "combinators":

  • times(p, mint, maxt=None): Repeats parser p from mint to maxt times,

  • count(p,n): Repeats parser p n-times. If n is smaller or equal to zero, the parser equals to return empty list,

  • (p, default_value=None): Make a parser optional. If success, return the result, otherwise return default_value silently, without raising any exception. If default_value is not provided None is returned instead,

  • many(p): Repeat parser p from never to infinitely many times,

  • many1(p): Repeat parser p at least once,

  • separated(p, sep, mint, maxt=None, end=None): ,

  • sepBy(p, sep): parses zero or more occurrences of parser p, separated by delimiter sep,

  • sepBy1(p, sep): parses at least one occurrence of parser p, separated by delimiter sep,

  • endBy(p, sep): parses zero or more occurrences of p, separated and ended by sep,

  • endBy1(p, sep): parses at least one occurrence of p, separated and ended by sep,

  • sepEndBy(p, sep): parses zero or more occurrences of p, separated and optionally ended by sep,

  • sepEndBy1(p, sep): parses at least one occurrence of p, separated and optionally ended by sep.


Using all of that, we have a parser which matches many occurrences of many possible_chars, followed by a <, then we mark the many occurrences of possible_chars up until >.