Building regex from another one

After a crash course on l3regex I managed to pull off two (rather stupid) ways to do what you ask. I'll post them as separate answers because they are completely different both at implementation and usage levels.

Disclaimer: This answer uses internal code of l3regex which definitely should not be used in a document so it might break as soon as something changes in the code of l3regex, so please do not use this code unless you really know what you are doing.

Inserting a compiled regex into another regex:

This one goes sort of along the lines of what you found in the documentation:

Provide a syntax such as \ur{l_my_regex} to use an already-compiled regex in a more complicated regex. This makes regexes more easily composable.

You first need to compile a regex using \regex_const:Nn or \regex:set:Nn or something like that, and then use the compiled regex inside the new regular expression:

\regex_const:Nn \c_bar_regex { [a-z]+ } % Compile \c_bar_regex
\regex_const:Nn \c_foo_regex { (\w+)( \[ \y{c_bar_regex} \] ) } % Insert it with \y{...}

I used \y instead of \ur because it wouldn't require too much changes in the l3regex code, instead I just needed to define a \__regex_compile_/y: function. It is the "cleanest" method because it doesn't require to change any of the internals of l3regex (though it uses a handful of them).

I defined an escape sequence \y{<regex var>}, much similar to \u{<tl var>} (in fact, most of the code is a copy replacing u by y), with the difference that \u expands \<tl var> and inserts the literal tokens inside it to be matched (exactly as in the question), while \y fetches the contents of the \<regex var> and injects it in the current regular expression being compiled.

This method seems the "cleverest" one, but it has major limitations. Since each regex is compiled in advance, they need to be a full regular expression, otherwise either the compilation will fail or the regex will mean something different. For instance, take the regular expression [a-z]+ and split it into two sub expressions: A=a-z, and B=[A]+. The complete expression matches the range [a,z], repeated 1 or more times, greedy. The sub-expression a-z, however, matches the three literal characters a-z, since - doesn't have the "range" meaning outside of [...]. Once you put A in B, B will match a, -, or z, repeated 1 or more times, greedy.

The usage of \y inside [...] can be disabled to prevent this type of problem, just un-comment the commented lines in \__regex_compile_/y:.


Reply to Edit 2: The initial version of the code would break in more complicated cases, as you noted. I stand my opinion that I prefer the approach in the other answer since it requires no knowledge of the inner workings of l3regex (which is about my level of knowledge :-), while this one messes with the internal structure of a compiled regex.

When l3regex compiles a regex of the form (a|b), the underlying regex is something like branch { group { a-branch b-branch } }, however when you add a token list with \u (code upon which the \y is built upon) the engine adds another branch inside the group: branch { group { branch { whatever-is-in-\u{...} } } }, and this additional branch is in an upper layer of the code which can't be changed by just adding an escape sequence like this code does. Luckily the l3regex doesn't seem to mind that added branch when matching the regex, so the output of:

% result of compiling "(a|b|c)"
branch { group { a-branch b-branch c-branch } }

and

% result of compiling "(\y{l_tmpa_tl})" with \l_tmpa_tl=a|b|c
branch { group { branch { a-branch b-branch c-branch } } }

seem to be the same (with a rather limited set of tests). So a small tweak to the previous code to take care of multiple branches seem to do the trick. If the code refuses to work with the added branch then a deeper mess-up of l3regex's code will probably be necessary to fix this. Let me know if something won't work right.

Here's the code:

\documentclass{article}
\usepackage{expl3}
\ExplSyntaxOn
\cs_new_protected:cpn { __regex_compile_/y: } #1#2
  {
    % Disable "dangerous" usage in [...]
    % \__regex_if_in_class_or_catcode:TF
    %   { \__regex_compile_raw_error:N y #1 #2 }
    %   {
        \__regex_two_if_eq:NNNNTF #1 #2 \__regex_compile_special:N \c_left_brace_str
          {
            \tl_set:Nx \l__regex_internal_a_tl { \if_false: } \fi:
            \__regex_compile_y_loop:NN
          }
          {
            \__kernel_msg_error:nn { kernel } { u-missing-lbrace }
            \__regex_compile_raw:N y #1 #2
          }
      % }
  }
\cs_new:Npn \__regex_compile_y_loop:NN #1#2
  {
    \token_if_eq_meaning:NNTF #1 \__regex_compile_raw:N
      { #2 \__regex_compile_y_loop:NN }
      {
        \token_if_eq_meaning:NNTF #1 \__regex_compile_special:N
          {
            \exp_after:wN \token_if_eq_charcode:NNTF \c_right_brace_str #2
              { \if_false: { \fi: } \__regex_compile_y_end: }
              { #2 \__regex_compile_y_loop:NN }
          }
          {
            \if_false: { \fi: }
            \__kernel_msg_error:nnx { kernel } { u-missing-rbrace } {#2}
            \__regex_compile_y_end:
            #1 #2
          }
      }
  }
\cs_new_protected:Npn \__regex_compile_y_end:
  {
    \tl_set:Nv \l__regex_internal_a_tl { \l__regex_internal_a_tl }
    \exp_args:NV \__regex_analyse_y:n \l__regex_internal_a_tl
  }
\cs_new_protected:Npn \__regex_analyse_y:n #1
  {
    \tl_if_head_is_N_type:nTF {#1}
      {
        \reverse_if:N \if_meaning:w \__regex_branch:n #1
          \msg_error:nn { siracusa / regex } { unknown-condition }
        \else:
          \exp_args:No \__regex_analyse_y_aux:n { \use_none:n #1 }
        \fi:
      }
      { \msg_error:nn { siracusa / regex } { unknown-condition } }
  }
\cs_new_protected:Npn \__regex_analyse_y_aux:n #1
  {
    \tl_if_empty:oTF { \use_none:n #1 }
      { \exp_args:NNo \tl_build_put_right:Nn \l__regex_build_tl { \use:n #1 } }
      { \tl_build_put_right:Nn \l__regex_build_tl { \__regex_branch:n #1 } }
  }
\msg_new:nnn { siracusa / regex } { unknown-condition }
  { Unknown/unimplemented~condition~in~code. }
\ExplSyntaxOff
\begin{document}
\ExplSyntaxOn
\regex_const:Nn \c_bar_regex { [a-z]+ }
\regex_const:Nn \c_foo_regex { (\w+)( \[ \y{c_bar_regex} \] ) }
% \regex_const:Nn \c_foo_regex { (\w+)( \[ [a-z]+ \] ) }
\regex_show:N \c_foo_regex

\seq_new:N \l_foo_seq
\regex_extract_all:NnN \c_foo_regex { a[x], b[yy], c[zzz] } \l_foo_seq
\seq_show:N \l_foo_seq
\ExplSyntaxOff
\end{document}

Bonus, for thee who disliketh the output of \regex_show

You need to patch \__regex_show:N:

\ExplSyntaxOn
\cs_new:Npn \__regex_show_if_visible_ascii:n #1
  {
    \if_int_compare:w
      \if_int_compare:w \int_eval:n{#1}>31  ~ 1 \else: 0 \fi:
      \if_int_compare:w \int_eval:n{#1}<127 ~ 1 \else: 0 \fi:
      = 11 \exp_stop_f:
      \c_space_tl (\char_generate:nn {#1} {12})
    \fi:
  }
\cs_set_protected:Npn \__regex_show:N #1
  {
    \group_begin:
      \tl_build_begin:N \l__regex_build_tl
      \cs_set_protected:Npn \__regex_branch:n
        {
          \seq_pop_right:NN \l__regex_show_prefix_seq
            \l__regex_internal_a_tl
          \__regex_show_one:n { +-branch }
          \seq_put_right:No \l__regex_show_prefix_seq
            \l__regex_internal_a_tl
          \use:n
        }
      \cs_set_protected:Npn \__regex_group:nnnN
        { \__regex_show_group_aux:nnnnN { } }
      \cs_set_protected:Npn \__regex_group_no_capture:nnnN
        { \__regex_show_group_aux:nnnnN { ~(no~capture) } }
      \cs_set_protected:Npn \__regex_group_resetting:nnnN
        { \__regex_show_group_aux:nnnnN { ~(resetting) } }
      \cs_set_eq:NN \__regex_class:NnnnN \__regex_show_class:NnnnN
      \cs_set_protected:Npn \__regex_command_K:
        { \__regex_show_one:n { reset~match~start~(\iow_char:N\\K) } }
      \cs_set_protected:Npn \__regex_assertion:Nn ##1##2
        {
          \__regex_show_one:n
            { \bool_if:NF ##1 { negative~ } assertion:~##2 }
        }
      \cs_set:Npn \__regex_b_test: { word~boundary }
      \cs_set_eq:NN \__regex_anchor:N \__regex_show_anchor_to_str:N
      \cs_set_protected:Npn \__regex_item_caseful_equal:n ##1
        {
          \__regex_show_one:n
            {
              char~code~\int_eval:n{##1}
              \__regex_show_if_visible_ascii:n {##1} % <-- Added
            }
        }
      \cs_set_protected:Npn \__regex_item_caseful_range:nn ##1##2
        {
          \__regex_show_one:n
            {
              range~[
                \int_eval:n{##1} \__regex_show_if_visible_ascii:n {##1}, % <-- Added
                \int_eval:n{##2} \__regex_show_if_visible_ascii:n {##2}  % <-- Added
              ]
            }
        }
      \cs_set_protected:Npn \__regex_item_caseless_equal:n ##1
        {
          \__regex_show_one:n
            {
              char~code~\int_eval:n{##1}
              \__regex_show_if_visible_ascii:n {##1}~(caseless) % <-- Added
            }
        }
      \cs_set_protected:Npn \__regex_item_caseless_range:nn ##1##2
        {
          \__regex_show_one:n
            {
              Range~[
                \int_eval:n{##1} \__regex_show_if_visible_ascii:n {##1}, % <-- Added
                \int_eval:n{##2} \__regex_show_if_visible_ascii:n {##2}  % <-- Added
              ]~(caseless)
            }
        }
      \cs_set_protected:Npn \__regex_item_catcode:nT
        { \__regex_show_item_catcode:NnT \c_true_bool }
      \cs_set_protected:Npn \__regex_item_catcode_reverse:nT
        { \__regex_show_item_catcode:NnT \c_false_bool }
      \cs_set_protected:Npn \__regex_item_reverse:n
        { \__regex_show_scope:nn { Reversed~match } }
      \cs_set_protected:Npn \__regex_item_exact:nn ##1##2
        {
          \__regex_show_one:n
            {
              char~##2
              \__regex_show_if_visible_ascii:n {##1} % <-- Added
              ,~catcode~##1
            }
        }
      \cs_set_eq:NN \__regex_item_exact_cs:n \__regex_show_item_exact_cs:n
      \cs_set_protected:Npn \__regex_item_cs:n
        { \__regex_show_scope:nn { control~sequence } }
      \cs_set:cpn { __regex_prop_.: } { \__regex_show_one:n { any~token } }
      \seq_clear:N \l__regex_show_prefix_seq
      \__regex_show_push:n { ~ }
      \cs_if_exist_use:N #1
      \tl_build_end:N \l__regex_build_tl
      \exp_args:NNNo
    \group_end:
    \tl_set:Nn \l__regex_internal_a_tl { \l__regex_build_tl }
  }
\ExplSyntaxOff

then \regex_show:n { (\w+)( \[ [a-z]+ \] ) } will print:

+-branch
  ,-group begin
  | Match, repeated 1 or more times, greedy
  |   range [97 (a),122 (z)]
  |   range [65 (A),90 (Z)]
  |   range [48 (0),57 (9)]
  |   char code 95 (_)
  `-group end
  ,-group begin
  | char code 91 ([)
  | range [97 (a),122 (z)], repeated 1 or more times, greedy
  | char code 93 (])
  `-group end.

After a crash course on l3regex I managed to pull off two (rather stupid) ways to do what you ask. I'll post them as separate answers because they are completely different both at implementation and usage levels.

Disclaimer: This answer uses internal code of l3regex which definitely should not be used in a document so it might break as soon as something changes in the code of l3regex, so please do not use this code unless you really know what you are doing.

Inserting a token list to be compiled as a regex:

This one steps in in the compilation of a regex right after the regular expression is tokenized but before the actual compilation takes place, replacing all \y{<tl var>} by the contents of \<tl var>. It's a really nasty brute-forcing of a token list inside a regular expression, but it apparently works fine.

With this method you can insert any token list to be compiled as a regular expression using \y{<tl var>}:

\tl_const:Nn \c_foo_tl { a-z }
\tl_const:Nn \c_bar_tl { [\y{c_foo_tl}]+ }
\regex_const:Nn \c_foo_regex { (\w+)( \[ \y{c_bar_tl} \] ) }

In the example above the token list \c_bar_tl is inserted into the regular expression. \c_bar_tl itself contains a recursive call to \y{c_foo_tl}. I offer two different macros which which you can choose if you want to allow recursion or not. If you don't allow recursion, the code above matches the literal string y{c_foo_tl}. If you allow recursion, then \y{c_foo_tl} expands and inserts a-z into the expression, which is then properly compiled. Mind you that with recursion enabled, infinite recursion is also enabled, if your code does so.

I used \y instead of \ur in this one just to keep the same character as in the other answer, but it could easily be \ur or some other string.

I injnected a macro in \__regex_escape_use:nnnn which, right before the regular expression is compiled, replaces all \y{<tl var>} by the contents of \<tl var>. This requires adding a line to \__regex_escape_use:nnnn, from which the code starts. The function then proceeds doing the proper replacements (recursively or not) until it finds the end of the expression. When it does the function returns the control to the regular expression engine to do its job normally.

Because this does the expansion of the escape sequence prior to the compilation of the regular expression there are no bounds on the contents of each \<tl var> used as long as the final expression, after the expansion of all \y results in a valid regular expression. This allows you to define one token list c_foo_tl with [a- an another with \y{c_foo_tl}z]+ and it would expand to the expected [a-z]+.

The function which controls the behaviour after the expansion of an escape sequence \y is \__regex_prescan_yank_eval_continue:n. Below there are two different versions to choose from, recursive and non-recursive, as discussed earlier.

Here's the code:

\documentclass{article}
\usepackage{expl3}
\ExplSyntaxOn
\cs_gset_protected:Npn \__regex_escape_use:nnnn #1#2#3#4
  {
    \group_begin:
      \tl_clear:N \l__regex_internal_a_tl
      \cs_set:Npn \__regex_escape_unescaped:N ##1 { #1 }
      \cs_set:Npn \__regex_escape_escaped:N ##1 { #2 }
      \cs_set:Npn \__regex_escape_raw:N ##1 { #3 }
      \__regex_standard_escapechar:
      \tl_gset:Nx \g__regex_internal_tl
        { \__kernel_str_to_other_fast:n {#4} }
      \__regex_prescan_yank:N \g__regex_internal_tl % <-- Added
      \tl_put_right:Nx \l__regex_internal_a_tl
        {
          \exp_after:wN \__regex_escape_loop:N \g__regex_internal_tl
          { break } \prg_break_point:
        }
      \exp_after:wN
    \group_end:
    \l__regex_internal_a_tl
  }
\cs_new_protected:Npn \__regex_prescan_yank:N #1
  { \tl_set:Nx #1 { \exp_args:NV \__regex_prescan_yank:n #1 } }
\cs_set:Npn \__regex_tmp:w #1#2
  {
    \cs_new:Npn \__regex_prescan_yank:n ##1
      { \__regex_prescan_yank:w ##1 #1 \q_nil #2 \q_stop }
    \cs_new:Npn \__regex_prescan_yank:w ##1 #1 ##2 #2
      {
        ##1
        \quark_if_nil:nT {##2}
          { \use_none_delimit_by_q_stop:w }
        \__regex_prescan_yank_eval_continue:n {##2}
      }
  }
\exp_args:Nxx \__regex_tmp:w
  { \__kernel_str_to_other_fast:n { \y } \c_left_brace_str }
  { \c_right_brace_str }
% \cs_new:Npn \__regex_prescan_yank_eval_continue:n #1
%   { % Non-recursive
%     \exp_args:Nv \__kernel_str_to_other_fast:n { #1 }
%     \__regex_prescan_yank:w
%   }
\cs_new:Npn \__regex_prescan_yank_eval_continue:n #1
  { % Recursive
    \exp_last_unbraced:Nf \__regex_prescan_yank:w
    \exp_args:Nv \__kernel_str_to_other_fast:n { #1 }
  }
\ExplSyntaxOff
\begin{document}
\ExplSyntaxOn
\tl_const:Nn \c_foo_tl { a-z }
\tl_const:Nn \c_bar_tl { [\y{c_foo_tl}]+ }
\regex_const:Nn \c_foo_regex { (\w+)( \[ \y{c_bar_tl} \] ) }
% \regex_const:Nn \c_foo_regex { (\w+)( \[ [a-z]+ \] ) }
\regex_show:N \c_foo_regex

\seq_new:N \l_foo_seq
\regex_extract_all:NnN \c_foo_regex { a[x], b[yy], c[zzz] } \l_foo_seq
\seq_show:N \l_foo_seq
\ExplSyntaxOff
\end{document}

Tags:

Expl3

L3Regex