Convert a word's characters into its ascii code list concisely in Raku

There are a couple of things we can do here to make it work.

First, let's tackle the @ascii variable. The @ sigil indicates a positional variable, but you assigned a single string to it. This creates a 1-element array ['abc...'], which will cause problems down the road. Depending on how general you need this to be, I'd recommend either creating the array directly:

my @ascii = <a b c d e f g h i j k l m n o p q r s t u v x y z>;
my @ascii = 'a' .. 'z';
my @ascii = 'abcdefghijklmnopqrstuvwxyz'.comb;

or going ahead and handling the any part:

my $ascii-char = any <a b c d e f g h i j k l m n o p q r s t u v x y z>;
my $ascii-char = any 'a' .. 'z';
my $ascii-char = 'abcdefghijklmnopqrstuvwxyz'.comb.any;

Here I've used the $ sigil, because any really specifies any single value, and so will function as such (which also makes our life easier). I'd personally use $ascii, but I'm using a separate name to make later examples more distinguishable.

Now we can handle the map function. Based on the above two versions of ascii, we can rewrite your map function to either of the following

{ push @tmp, $_.ord if $_ eq @ascii.any  }
{ push @tmp, $_.ord if $_ eq $ascii-char }

Note that if you prefer to use ==, you can go ahead and create the numeric values in the initial ascii creation, and then use $_.ord. As well, personally, I like to name the mapped variable, e.g.:

{ push @tmp, $^char.ord if $^char eq @ascii.any  }
{ push @tmp, $^char.ord if $^char eq $ascii-char }

where $^foo replaces $_ (if you use more than one, they map alphabetical order to @_[0], @_[1], etc).

But let's get to the more interesting question here. How can we do all of this without needing to predeclare @tmp? Obviously, that just requires creating the array in the map loop. You might think that might be tricky for when we don't have an ASCII value, but the fact that an if statement returns Empty (or () ) if it's not run makes life really easy:

my @tmp = map { $^char.ord if $^char eq $ascii-char }, "wall".comb;
my @tmp = map { $^char.ord if $^char eq @ascii.any  }, "wall".comb;

If we used "wáll", the list collected by map would be 119, Empty, 108, 108, which is automagically returned as 119, 108, 108. Consequently, @tmp is set to just 119, 108, 108.

Yes there is a much simpler way.

"wall".ords.grep('az'.ords.minmax);

Of course this relies on a to z being an unbroken sequence. This is because minmax creates a Range object based on the minimum and maximum value in the list.

If they weren't in an unbroken sequence you could use a junction.

"wall".ords.grep( 'az'.ords.minmax | 'AZ'.ords.minmax );

But you said that you want to match other languages. Which to me screams regex.

"wall".comb.grep( /^ <:Ll> & <:ascii> $/ ).map( *.ord )

This matches Lowercase Letters that are also in ASCII.

Actually we can make it even simpler. comb can take a regex which determines which characters it takes from the input.

"wall".comb( / <:Ll> & <:ascii> / ).map( *.ord )
# (119, 97, 108, 108)

"ΓΔαβγδε".comb( / <:Ll> & <:Greek> / ).map( *.ord )
# (945, 946, 947, 948, 949)
# Does not include Γ or Δ, as they are not lowercase

Note that the above only works with ASCII if you don't have a combining accent.

 "de\c[COMBINING ACUTE ACCENT]f".comb( / <:Ll> & <:ascii> / )
 # ("d", "f")

The Combining Acute Accent combines with the e which composes to Latin Small Letter E With Acute. That composed character is not in ASCII so it is skipped.

It gets even weirder if there isn't a composed value for the character.

"f\c[COMBINING ACUTE ACCENT]".comb( / <:Ll> & <:ascii> / )
# ("f́",)

That is because the f is lowercase and in ASCII. The composing codepoint gets brought along for the ride though.

Basically if your data has, or can have combining accents and if it could break things, then you are better off dealing with it while it is still in binary form.

$buf.grep: {
    .uniprop() eq 'Ll' #
    && .uniprop('Block') eq 'Basic Latin' # ASCII
}

The above would also work for single character strings because .uniprop works on either integers representing a codepoint, or on the actual character.

"wall".comb.grep: {
    .uniprop() eq 'Ll' #
    && .uniprop('Block') eq 'Basic Latin' # ASCII
}

Note again that this would have the same issues with composing codepoints since it works with strings.

You may also want to use .uniprop('Script') instead of .uniprop('Block') depending on what you want to do.

Here's a working approach using Raku's trans method (code snippet performed in the Raku REPL):

> my @a = "wall".comb;
[w a l l]
> @a.trans('abcdefghijklmnopqrstuvwxyz' => ords('abcdefghijklmnopqrstuvwxyz') ).put;
119 97 108 108

Above, we handle an ascii string. Below I add the "é" character, and show a 2-step solution:

> my @a = "wallé".comb;
[w a l l é]
> my @b = @a.trans('abcdefghijklmnopqrstuvwxyz' => ords('abcdefghijklmnopqrstuvwxyz') );
[119 97 108 108 é]
> @b.trans("é" => ords("é")).put
119 97 108 108 233

Nota bene #1: Although all the code above works fine, when I tried shortening the alphabet to 'a'..'z' I ended up seeing erroneous return values...hence the use of the full 'abcdefghijklmnopqrstuvwxyz'.

Nota bene #2: One question in my mind is trying to suppress output when trans fails to recognize a character (e.g. how to suppress assignment of "é" as the last element of @b in the second-example code above). I've tried adding the :delete argument to trans, but no luck.

EDITED: To remove unwanted characters, here's code using grep (à la @Brad Gilbert), followed by trans:

> my @a = "wallé".comb;
[w a l l é]
> @a.grep('a'..'z'.comb.any).trans('abcdefghijklmnopqrstuvwxyz' => ords('abcdefghijklmnopqrstuvwxyz') ).put
119 97 108 108

Convert a word's characters into its ascii code list concisely in Raku

Tags:

List

Raku

Related

Recent Posts