How to determine if a number is a prime with regex?

I will explain the regex part outside of primality testing: the following regex, given a String s which consists of repeating String t, finds t.

    System.out.println(
        "MamamiaMamamiaMamamia".replaceAll("^(.*)\\1+$", "$1")
    ); // prints "Mamamia"

The way it works is that the regex captures (.*) into \1, and then sees if there's \1+ following it. Using the ^ and $ ensures that a match must be of the whole string.

So, in a way, we're given String s, which is a "multiple" of String t, and the regex will find such t (the longest possible, since \1 is greedy).

Once you understand why this regex works, then (ignoring the first alternate in OP's regex for now) explaining how it's used for primality testing is simple.

  • To test primality of n, first generate a String of length n (filled with the same char)
  • The regex captures a String of some length (say k) into \1, and tries to match \1+ to the rest of the String
    • If there is a match, then n is a proper multiple of k, and therefore n is not prime.
    • If there's no match, then no such k exists that divides n, and n is therefore a prime

How does .?|(..+?)\1+ match prime numbers?

Actually, it doesn't! It matches String whose length is NOT prime!

  • .? : The first part of the alternation matches String of length 0 or 1 (NOT prime by definition)
  • (..+?)\1+ : The second part of the alternation, a variation of the regex explained above, matches String of length n that is "a multiple" of a String of length k >= 2 (i.e. n is a composite, NOT a prime).
    • Note that the reluctant modifier ? is actually not needed for correctness, but it may help speed up the process by trying smaller k first

Note the ! boolean complement operator in the return statement: it negates the matches. It's when the regex DOESN'T match, n is prime! It's a double-negative logic, so no wonder it's kind of confusing!!


Simplification

Here's a simple rewriting of the code to make it more readable:

public static boolean isPrime(int n) {
    String lengthN = new String(new char[n]);
    boolean isNotPrimeN = lengthN.matches(".?|(..+?)\\1+");
    return !isNotPrimeN;
}

The above is essentially the same as the original Java code, but broken apart into multiple statements with assignments to local variables to make the logic easier to understand.

We can also simplify the regex, using finite repetition, as follows:

boolean isNotPrimeN = lengthN.matches(".{0,1}|(.{2,})\\1+");

Again, given a String of length n, filled with the same char,

  • .{0,1} checks if n = 0,1, NOT prime
  • (.{2,})\1+ checks if n is a proper multiple of k >= 2, NOT prime

With the exception of the reluctant modifier ? on \1 (omitted for clarity), the above regex is identical to the original.


More fun regex

The following regex uses similar technique; it should be educational:

System.out.println(
    "OhMyGod=MyMyMyOhGodOhGodOhGod"
        .replaceAll("^(.+)(.+)(.+)=(\\1|\\2|\\3)+$", "$1! $2! $3!")
); // prints "Oh! My! God!"

See also

  • Regular expressions: who's greedier

/^1?$|^(11+?)\1+$/

Apply to numbers after conversion to base 1 (1=1, 2=11, 3=111, ...). Non-primes will match this. If it doesn't match, it is prime.

Explanation here.


You said you understand this part, but just to emphasize, the String generated has a length equal to the number supplied. So the string has three characters if and only if n == 3.

.?

The first part of the regex says, "any character, zero or one times". So basically, is there zero or one character-- or, per what I mentioned above, n == 0 || n == 1. If we have the match, then return the negation of that. This corresponds with the fact that zero and one are NOT prime.

(..+?)\\1+

The second part of the regex is a little trickier, relying on groups and backreferences. A group is anything in parentheses, which will then be captured and stored by the regex engine for later use. A backreference is a matched group that is used later on in the same regex.

The group captures 1 character, then 1 or more of any character. (The + character means one or more, but ONLY of the previous character or group. So this is not "two or four or six etc. characters", but rather "two or three etc." The +? is like +, but it tries to match as few characters as possible. + normally tries to gobble the whole string if it can, which is bad in this case because it prevents the backreference part from working.)

The next part is the backreference: That same set of characters (two or more), appearing again. Said backreference appears one or more times.

So. The captured group corresponds to a natural number of characters (from 2 onward) captured. Said group then appears some natural number of times (also from 2 onward). If there IS a match, this implies that it's possible to find a product of two numbers greater than or equal to 2 that match the n-length string... meaning you have a composite n. So again, return the negation of the successful match: n is NOT prime.

If no match can be found, then you can't come up with a your product of two natural numbers greater than or equal to 2... and you have both a non-match and a prime, hence again the returning of the negation of the match result.

Do you see it now? It's unbelievably tricky (and computationally expensive!) but it's also kind of simple at the same time, once you get it. :-)

I can elaborate if you have further questions, like on how regex parsing actually works. But I'm trying to keep this answer simple for now (or as simple as it can ever be).


Nice regex trick (though very inefficient)... :)

The regex defines non-primes as follows:

N is not prime if and only if N<=1 OR N is divisible by some K>1.

Instead of passing the simple digital representation of N to the regex engine, it is fed with a sequence of length N, composed of a repeating character. The first part of the disjunction checks for N=0 or N=1, and the second one looks for a divisor K>1, using backreferences. It forces the regex engine to find some non empty sub-sequence that can be repeated at least twice in order to form the sequence. If such a subsequence exists, it means that its length divides N, hence N is not prime.

Tags:

Java

Primes

Regex