Negated Named Regex, or Character Class Interpolation in Raku

Assuming that you just want to match the same quote character again.

token attribute-value { <string> }

token string {
  # match <quote> and expect to end with "$<quote>"
  <quote> ~ "$<quote>"

  [
    # update match structure in $/ otherwise "$<quote>" won't work
    {}

    <!before "$<quote>"> # next character isn't the same as $<quote>

    .    # any character

  ]*     # any number of times
}

token quote { <["']> }

For anything more complex use something like the $*end-quote dynamic variable from the earlier answer.


There are a few different approaches that you can take — which one is best will probably depend on the rest of the structure you're employing.

But first an observation on your current solution and why opening it up to others won't work this way. Consider the string 'value". Should that parse? The structure you laid out actually would match it! That's because each <quote> token will match either a single or double quote.

Dealing with the inner

The simplest solution is to make your inner part a non-greedy wildcard:

<quote> (.*?) <quote>

This will stop the match as soon as you reach quote again. Also note the alternative syntax using a tilde that lets the two terminal bits be closer together:

<quote> ~ <quote> (.*?)

Your initial attempt wanted to use a sort of non-match. This does exist in the form of an assertion, <!quote> which will fail if a <quote> is found (which needn't be just a character, by any thing arbitrarily complex). It doesn't consume, though, so you need to provide that separately. For instance

[<!quote> .]*

Will check that something is NOT a quote, and then consume the next character.

Lastly, you could use either of the two approaches and use a <content> token that handles in the inside. This is actually a great approach if you intend to later do more complex things (e.g. escape characters).

Avoiding a mismatch

As I noted, your solution would parse mismatched quotes. So we need to have a way to ensure that the quote we are (not) matching is the same as the start one. One way to do this is using a multi token:

proto token attribute_value (|) { * }
multi token attribute_value:sym<'> { <sym> ~ <sym> <-[']> }
multi token attribute_value:sym<"> { <sym> ~ <sym> <-["]> }

(Using the actual token <sym> is not require, you could write it as { \' <-[']> \'} if you wanted).

Another way you could do this is by passing a parameter (either literally, or via dynamic variables). For example, you could make write the attribute_value as

token attribute_value {
    $<start-quote>=<quote>      # your actual start quote
    :my $*end-quote;            # define the variable in the regex scope
    { $*end-quote = ... }       # determine the requisite end quote (e.g. ” for “)
    <attribute_value_contents>  # handle actual content
    $*end-quote                 # fancy end quote
}

token attribute_value_contents {
    # We have access to $*end-quote here, so we can use
    # either of the techniques we've described before
    # (a) using a look ahead
    [<!before $*end-quote> .]*
    # (b) being lazy (the easier)
    .*?
    # (c) using another token (described below)
    <attr_value_content_char>+
}

I mention the last one because you can even further delegate if you ultimately decide to allow for escape characters. For example, you could then do

proto token attr_value_content_char (|) { * }
multi token attr_value_content_char:sym<escaped> { \\ $*end-quote }
multi token attr_value_content_char:sym<literal> { . <?{ $/ ne $*end-quote }> }

But if that's overkill for what you're doing, ah well :-)

Anyways, there are probably other ways that didn't jump to my mind that others can think of, but that should hopefully put you on the right path. (also some of this code is untested, so there may be slight errors, apologies for that)

Tags:

Regex

Raku