Match strings whose length is a fourth power

Another Solution

This is, in my opinion, one of the most interesting problems on the site. I need to thank deadcode for bumping it back up to the top.

^((^|xx)(^|\3\4\4)(^|\4x{12})(^x|\1))*$

39 bytes, without any conditionals or assertions... sort of. The alternations, as they're being used (^|), are a type of conditional in a way, to select between "first iteration," and "not first iteration."

This regex can be seen to work here: http://regex101.com/r/qA5pK3/1

Both PCRE and Python interpret the regex correctly, and it has also been tested in Perl up to n = 128, including n4-1, and n4+1.


Definitions

The general technique is the same as in the other solutions already posted: define a self-referencing expression which on each subsequent iteration matches a length equal to the next term of the forward difference function, Df, with an unlimited quantifier (*). A formal definition of the forward difference function:

Definition 1: forward difference function

Additionally, higher order difference functions may also be defined:

Definition 2: second forward difference function

Or, more generally:

Definition 3: kth forward difference function

The forward difference function has a lot of interesting properties; it is to sequences what the derivative is to continuous functions. For example, Df of an nth order polynomial will always be an n-1th order polynomial, and for any i, if Dfi = Dfi+1, then the function f is exponential, in much the same way that the derivative of ex is equal to itself. The simplest discrete function for which f = Df is 2n.


f(n) = n2

Before we examine the above solution, let's start with something a bit easier: a regex which matches strings whose lengths are a perfect square. Examining the forward difference function:

FDF: n^2

Meaning, the first iteration should match a string of length 1, the second a string of length 3, the third a string of length 5, etc., and in general, each iteration should match a string two longer than the previous. The corresponding regex follows almost directly from this statement:

^(^x|\1xx)*$

It can be seen that the first iteration will match only one x, and each subsequent iteration will match a string two longer than the previous, exactly as specified. This also implies an amazingly short perfect square test in perl:

(1x$_)=~/^(^1|11\1)*$/

This regex can be further generalized to match any n-gonal length:

Triangular numbers:
^(^x|\1x{1})*$

Square numbers:
^(^x|\1x{2})*$

Pentagonal numbers:
^(^x|\1x{3})*$

Hexagonal numbers:
^(^x|\1x{4})*$

etc.


f(n) = n3

Moving on to n3, once again examining the forward difference function:

FDF: n^3

It might not be immediately apparent how to implement this, so we examine the second difference function as well:

FDF^2: n^3

So, the forwards difference function does not increase by a constant, but rather a linear value. It's nice that the initial ('-1th') value of Df2 is zero, which saves an initialization on the second iteration. The resulting regex is the following:

^((^|\2x{6})(^x|\1))*$

The first iteration will match 1, as before, the second will match a string 6 longer (7), the third will match a string 12 longer (19), etc.


f(n) = n4

The forward difference function for n4:

FDF: n^4

The second forward difference function:

FDF^2: n^4

The third forward difference function:

FDF^3: n^4

Now that's ugly. The initial values for Df2 and Df3 are both non-zero, 2 and 12 respectively, which will need to be accounted for. You've probably figured out by now that the regex will follow this pattern:

^((^|\2\3{b})(^|\3x{a})(^x|\1))*$

Because the Df3 must match a length of 12 on the second iteration, a is necessarily 12. But because it increases by 24 each term, the next deeper nesting must use its previous value twice, implying b = 2. The final thing to do is initialize the Df2. Because Df2 influences Df directly, which is ultimately what we want to match, its value can be initialized by inserting the appropriate atom directly into the regex, in this case (^|xx). The final regex then becomes:

^((^|xx)(^|\3\4{2})(^|\4x{12})(^x|\1))*$

Higher Orders

A fifth order polynomial can be matched in with the following regex:
^((^|\2\3{c})(^|\3\4{b})(^|\4x{a})(^x|\1))*$

f(n) = n5 is a fairly easy excercise, as the initial values for both the second and fourth forward difference functions are zero:

^((^|\2\3)(^|\3\4{4})(^|\4x{30})(^x|\1))*$

For six order polynomials:
^((^|\2\3{d})(^|\3\4{c})(^|\4\5{b})(^|\5x{a})(^x|\1))*$

For seventh order polynomials:
^((^|\2\3{e})(^|\3\4{d})(^|\4\5{c})(^|\5\6{b})(^|\6x{a})(^x|\1))*$

etc.

Note that not all polynomials can be matched in exactly this way, if any of the necessary coefficients are non-integer. For example, n6 requires that a = 60, b = 8, and c = 3/2. This can be worked around, in this instance:

^((^|xx)(^|\3\6\7{2})(^|\4\5)(^|\5\6{2})(^|\6\7{6})(^|\7x{60})(^x|\1))*$

Here I've changed b to 6, and c to 2, which have the same product as the above stated values. It's important that the product doesn't change, as a·b·c·… controls the constant difference function, which for a sixth order polynomial is Df6. There are two initialization atoms present: one to initialize Df to 2, as with n4, and the other to initialize the fifth difference function to 360, while at the same time adding in the missing two from b.


This (ir)regular expression seems to work.

^((?(1)((?(2)\2((?(3)\3((?(4)\4x{24}|x{60}))|x{50}))|x{15}))|x))*$

This regex is compatible with PCRE, Perl, .NET flavors.

This basically follows a "difference tree" (not sure if there's a proper name for it), which tells the regex how many more x's to match for the next fourth power:

1     16    81    256   625   1296  2401 ...
   15    65    175   369   671   1105 ...
      50    110   194   302   434 ...
         60    84    108   132 ...
            24    24    24 ...  # the differences level out to 24 on the 4th iteration

\2, \3, \4 stores and updates the difference as shown on the 2nd, 3rd and 4th rows, respectively.

This construct can easily be extended for higher powers.

Certainly not an elegant solution, but it does work.


Here is a solution that does not use conditionals, forward-declared or nested backreferences, lookbehind, balancing groups, or regex recursion. It only uses lookahead and standard backreferences, which are very widely supported. I was inspired to operate under these limitations due to Regex Golf, which uses the ECMAScript regex engine.

The way this 50 byte regex works is conceptually rather simple, and completely different than all the other submitted solutions to this puzzle. It was surprising to discover that this kind of mathematical logic was expressible in a regex.

      \2                     \4  \5
^((?=(xx+?)\2+$)((?=\2+$)(?=(x+)(\4+)$)\5){4})*x?$

(Capture groups are labeled above the regex)

The regex can be generalized to any power simply by replacing the 4 in {4} with the desired power.

Try it online!

It works by repeatedly dividing away the smallest fourth power of a prime that the current value is divisible by. As such the quotient at each step is always a fourth power, iff the original value was a fourth power. A final quotient of 1 indicates that the original value was indeed a fourth power; this completes the match. Zero is also matched.

First it uses a lazy capture group \2 to capture the number's smallest factor larger than 1. As such, this factor is guaranteed to be prime. For example, with 1296 (6^4) it will initially capture \2 = 2.

Then, at the beginning of a loop that is repeated 4 times, it tests to see if the current number is divisible by \2, with (?=\2+$). The first time through this loop, this test is useless, but its purpose will become apparent later.

Next inside this inner loop, it uses the greedy capture group \4 to capture the number's largest factor smaller than itself: (?=(x+)(\4+)$). In effect this divides the number by its smallest prime factor, \2; for example, 1296 will initially be captured as \4 = 1296/2 = 648. Note that the division of the current number by \2 is implicit. While it is possible to explicitly divide the current number by a number contained in a capture group (which I only discovered four days after posting this answer), doing this would make for a slower and harder-to-understand regex, and it is not necessary, since a number's smallest factor larger than 1 will always match up with its largest factor smaller than itself (such that their product is equal to the number itself).

Since this kind of regex can only "eat away" from the string (making it smaller) by leaving a result at the end of the string, we need to "move" the result of the division to the end of the string. This is done by capturing the result of subtraction (the current number minus \4), into the capture group \5, and then, outside the lookahead, matching a portion of the beginning of the current number corresponding to \5. This leaves the remaining unprocessed string at the end matching \4 in length.

Now it loops back to the beginning of the inner loop, where it becomes apparent why there is a test for divisibility by the prime factor. We have just divided by the number's smallest prime factor; if the number is still divisible by that factor, it means the original number might be divisible by the fourth power of that factor. The first time this test is done it is useless, but the next 3 times, it determines if the result of implicitly dividing by \2 is still divisible by \2. If it is still divisible by \2 at the beginning of each iteration of the loop, then this proves that each iteration divided the number by \2.

In our example, with an input of 1296, this will loop through as follows:

\2 = 2
\4 = 1296/2 = 648
\4 = 648/2 = 324
\4 = 324/2 = 162
\4 = 162/2 = 81

Now the regex can loop back to the first step; this is what the final * does. In this example, 81 will become the new number; the next loop will go as follows:

\2 = 3
\4 = 81/3 = 27
\4 = 27/3 = 9
\4 = 9/3 = 3
\4 = 3/3 = 1

It will now loop back to the first step again, with 1 as the new number.

The number 1 cannot be divided by any prime, which would make it a non-match by (?=(xx+?)\2+$), so it exits the top-level loop (the one with * at the end). It now reaches the x?$. This can only match zero or one. The current number at this point will be 0 or 1 if and only if the original number was a perfect fourth power; if it is 0 at this point, it means that the top-level loop never matched anything, and if it is 1, it means the top-level loop divided a perfect fourth power down until it wasn't divisible by anything anymore (or it was 1 in the first place, meaning the top-level loop never matched anything).

It's also possible to solve this in 49 bytes by doing repeated explicit division (which is also generalized for all powers – replace the desired power minus one into the {3}), but this method is far, far slower, and an explanation of the algorithm it uses is beyond the scope of this Answer:

^((x+)((\2(x+))(?=(\4*)\2*$)\4*(?=\5$\6)){3})?x?$

Try it online!