Is it possible to interpolate Array values in token?

The Unicode::Security module implements confusables by using the Unicode consortium tables. It's actually not using regular expressions, just looking up different characters in those tables.


I'm not sure this is the best approach to use.

I haven't implemented a confusables1 module yet in Intl::, though I do plan on getting around to it eventually, here's two different ways I could imagine a token looking.2

my token confusable($source) {
  :my $i = 0;                                    # create a counter var
  [
    <?{                                          # succeed only if
      my $a = self.orig.substr: self.pos+$i, 1;  #   the test character A
      my $b = $source.substr: $i++, 1;           #   the source character B and

      so $a eq $b                                #   are the same or
      || $a eq %*confusables{$b}.any;            #   the A is one of B's confusables
    }> 
    .                                            # because we succeeded, consume a char
  ] ** {$source.chars}                           # repeat for each grapheme in the source
}

Here I used the dynamic hash %*confusables which would be populated in some way — that will depend on your module and may not even necessarily be dynamic (for example, having the signature :($source, %confusables) or referencing a module variable, etc.

You can then have your code work as follows:

say $foo ~~ /<confusable: 'foo'>/

This is probably the best way to go about things as it will give you a lot more control — I took a peak at your module and it's clear you want to enable 2-to-1 glyph relationships and eventually you'll probably want to be running code directly over the characters.

If you are okay with just 1-to-1 relationships, you can go with a much simpler token:

my token confusable($source) {
  :my @chars = $source.comb;            # split the source 
  @(                                    # match the array based on 
     |(                                 #   a slip of
        %confusables{@chars.head}       #     the confusables 
        // Empty                        #     (or nothing, if none)
     ),                                 #
     @a.shift                           #   and the char itself
   )                                    #
   ** {$source.chars}                   # repeating for each source char
}

The @(…) structure lets you effectively create an adhoc array to be interpolated. In this case, we just slip in the confusables with the original, and that's that. You have to be careful though because a non-existent hash item will return the type object (Any) and that messes things up here (hence // Empty)

In either case, you'll want to use arguments with your token, as constructing regexes on the fly is fraught with potential gotchas and interpolations errors.


1Unicode calls homographs both "visually similar characters" and "confusables".

2The dynamic hash here %confusables could be populated any number of ways, and may not necessarily need to be dynamic, as it could be populated via the arguments (using a signature like :($source, %confusables) or referencing a module variable.

Tags:

Raku