Grammar and unicode characters

From the « and » "left and right word boundary" doc:

[«] matches positions where there is a non-word character at the left, or the start of the string, and a word character to the right.

isn't a word character. So the word boundary assertion fails.

What is and isn't a "word character"

"word", in the sense of the \w character class, has the same definition in P6 as it does in P5 (when not using the P5 \a regex modifier), namely letters, some decimal digits, or an underscore:

  • Characters whose Unicode general category starts with an L, which stands for Letter.1

  • Characters whose Unicode general category is Nd, which stands for Number, decimal.2

  • _, an underscore.

"alpha 'Nd under"

In a comment below @p6steve++ contributes a cute mnemonic that adds "under" to the usual "alphanum".

But "num" is kinda wrong because it isn't any number but only some decimal digits, specifically the characters that match the Unicode General Category Nd (matched by P6 regex /<:Nd>/).2

This leads naturally to alphaNdunder (alpha Nd under) pronounced "alpha 'nd under".

Footnotes

1 Letters are matched by the P6 regex /<:L>/. This includes Ll (Letter, lowercase) (matched by /<:Ll>/) as JJ notes but also others including Lu (Letter, uppercase) and Lo (Letter, other), which latter includes the character JJ also mentions. There are other letter sub-categories too.

2 Decimal digits with the Unicode general category Nd are matched by the P6 regex /<:Nd>/. This covers decimal digits that can be chained together to produce arbitrarily large decimal numbers where each digit position adds a power of ten. It excludes decimal digits that have a "typographic context" (my phrasing follows the example of Wikipedia). For example, 1 is the English decimal digit denoting one; it is included. But ¹ and are excluded because they have a "typographic context". For a billion+ people their native languages use to denote one and is included in the Nd category for decimal digits. But for another billion+ people their native languages use for one but it is excluded from the Nd category (and is in the L category for letters instead). Similarly (Devanagari 6) is included in the Nd category but (Han number 6) is excluded.


I keep starting my answers with "Raiph is right". But he is. Also, an example of why this is so:

for <y ✓ Ⅲ> {
    say $_.uniprops;
    say m/<|w>/;
}

The second line of the loop compares against the word boundary anchor; just the first character, which can be a part of an actual word, matches that anchor. It also prints the Unicode properties in the first line of the loop; in the first case it's a letter, (Ll), it's not in the other two cases. You can use any Ll character as part of a word, and in your grammar, but only characters with that Unicode property can actually form words.

grammar G {


  proto rule TOP { * }

  rule TOP:sym<y>  { «<.sym>» }
  rule TOP:sym<ら>  { «<.sym>» }

}

say G.parse('y'); # 「y」
say G.parse('ら'); # This is a hiragana letter, so it works.

Tags:

Raku