How to improve the performance of solutions to Project Euler (#39)?

We are challenged to determine "how fast MMa can get" and, in so doing, to suggest rules "to choose different programming styles." The original solution takes 116 seconds (on my machine). At the time the question was posted, the solution time had been reduced by a factor of 1000 (10 doublings of speed) to 0.124 seconds by suggestions from users in chat.

This solution takes 1300 microseconds (0.0013 seconds) on the same machine, for a further 100-fold speedup (another 7 doublings):

euler39[p_] := Commonest @ Flatten[Table[ l, 
      {m, 2, Sqrt[p/2]}, 
      {n, Select[Range[m - 1], GCD[m #, m + #] == 1 && OddQ[m - #] &]}, 
      {l, 2 m (m + n), p, 2 m (m + n)}]];
Timing[Table[euler39[1000], {i, 1, 1000}];]

{1.311, Null}

It scales nicely (changing Table to ParallelTable to double the speed on larger problems):

AbsoluteTiming[euler39p[10^8]]

{120.8409117, {77597520}}

That is almost linear performance.

Note the simplicity of the basic operations: this program could be ported to any machine that can loop, add, multiply, and tally (the square root can be eliminated). I estimate that an efficient compiled implementation could perform the same calculation in just a few microseconds, using just 500 bytes of RAM, for up to another nine doublings in speed.

This solution was obtained through a process that, in my experience, generalizes to almost all forms of scientific computation:

  1. The problem was analyzed theoretically to identify an efficient algorithm. Resulting speed: 0.062 seconds.

  2. A timing analysis identified an MMa post-processing bottleneck. Some tweaking of this improved the timing. Speed: 0.0036 seconds (3600 microseconds).

  3. In comments, J.M. and Simon Woods each suggested better MMa constructs, together reducing the execution time to 2400 microseconds.

  4. The MMa bottleneck was removed altogether by a re-examination of the algorithm and the data structure, achieving a final reduction to 1300 microseconds (and considerably less RAM usage).

Ultimately a speedup factor of 90,000 was achieved, and this was done solely by means of algorithmic improvements: none of it can be attributed to programming style. Better MMa programmers than me will doubtlessly be able to squeeze most of the next nine speed doublings by compiling the code and making other optimizations, but--short of obtaining a direct $O(1)$ formula for the answer (which more or less would circumvent the whole point of the exercise, which is to use the computer for investigating a problem rather than for mere implementation of a theory-derived solution)--no more real speedup is possible. Note that compilation would also take us out of the MMa way of doing computation and bring us down to the procedural level of C and other compiled code.

The important lesson of this experience is that algorithm design is paramount. Don't worry about programming style or tweaking code: use your mathematical and computer science knowledge to find a better algorithm; implement a prototype; profile it; and--always focusing on the algorithm--see what can be done to eliminate bottlenecks. In my experience, one rarely has to go beyond this stage.

Detail of the story, as amended several times during development of this solution, follow.


This problem invites us to learn a tiny bit of elementary number theory, in the expectation it can result in a substantial change in the algorithm: that's how to really speed up a computation. With its help we learn that these Pythagorean triples can be parameterized by integers $\lambda \gt 0$ and $m \gt n \gt 0$ with $m$ relatively prime to $n$. We may take $x = \lambda(2 m n)$, $y = \lambda(m^2-n^2)$, and $z = \lambda(m^2+n^2)$, whence the perimeter is $p = 2 \lambda m (m+n)$.

The restrictions imposed by $p\le 1000$ and the obvious fact that $p$ is even give the limits for a triple loop over the parameters, implemented in a Table command below. The rest can be done without much thought--inelegantly and slowly--with brute force post-processing to avoid double counting $(x,y,z)$ and $(y,x,z)$ as solutions, to gather and count the solutions for each $p$, and select the commonest one.

(Although a triple loop sounds awful--one's instinctive reaction is to recoil at what looks like a $O(p^3)$ algorithm--notice that $m$ cannot exceed $\sqrt{p/2}$ and $n$ must be smaller yet, leaving few options for $\lambda$ in general. This gives us something like a $O(p f(p))$ algorithm with $f$ slowly growing, which scales very well. This limitation in the loop lengths is the key to the speed of this approach.)

euler39[p_] := Module[{candidates, scores, best},
   candidates = 
    Flatten[Table[{Through[{Min, Max}[2 l m n, l (m^2 - n^2)]], 2 l m (m + n)}, 
       {m, 3, Floor[Sqrt[p/2]]},
       {n, 1, m - 1}, 
       {l, 1, If[GCD[m, n] > 1, 0,  Floor[p / (2 m (m + n))]]}], 2];
   scores = {Last[Last[#]], Length[#]} & /@ 
     DeleteDuplicates /@ 
      Gather[candidates[[Ordering[candidates[[;; , 2]]]]], Last[#1] == Last[#2] &];
   best = Max[scores[[;; , 2]]];
   Select[scores, Last[#] >= best &]
  ];

The amount of speedup is surprising. Accurate timing requires repetition because the calculation is so fast:

Timing[Table[euler39[1000], {i, 1, 1000}]]

{3.619, {{{840, 8}}, {{840, 8}}, ...

I.e., the time to solve the problem is $0.0036$ seconds or $1/17000$ minutes. This makes larger versions of the problem accessible (using ParallelTable instead of Table to exploit some extra cores in part of the algorithm):

euler39[5 10^6] // AbsoluteTiming

{55.1441541, {{4084080, 168}}}

Even accounting for the parallelization, the timing is scaling nicely: it appears to be acting like $O(p\log(p))$. The limiting factor in MMA is RAM: the program needed about 4 GB for this last calculation and attempted to claim almost 20 GB for euler39[10^7] (but failed due to lack of RAM on this machine). This, too, could be streamlined if necessary using a more compact data structure, and perhaps could allow arguments up to $10^8$ or so.

Perhaps a solution that is faster yet (for smaller values of $p$, anyway) can be devised by factoring $p$, looping over the factors $\lambda$, and memoizing values for smaller $p$. But, at $1/300$ of a second, we have already achieved a four order of magnitude speedup, so it doesn't seem worth the bother.

Remarkably, this is much faster than the built-in PowersRepresentations solution found by Simon Woods.


Edit

At this point, J.M. and Simon Woods weighed in with better MMa code, together speeding up the solution by 50% (see the comments). In pondering their achievement, and wondering how much further one could go, it became apparent that the bottleneck lay in the post processing to remove duplicates. What if we could generate each solution exactly once? There would no longer be any need for a complicated data structure--we could just tally the number of times each perimeter was obtained--and no post-processing at all.

To assure no duplication, we need to check that when generating a triple $\{x,y,p\}$ with $x^2 + y^2 = z^2$ and $x+y+z=p$, we do not later also generate $\{y,x,p\}$: that's how the duplicates arise. The initial effort tracked possible duplicates by forcing $x\le y$. The improved idea is to look at parity.

The parameter $\lambda$ is intended to be the greatest common divisor of $\{x,y,p\}$. When it is, $x=2 m n$ and $y = m^2-n^2$ must be relatively prime. Because $x$ is obviously even, $y$ must be odd: that uniquely determines which of these two numbers is $x$ and which is $y$. Therefore, we do not need to check for duplicates if, in the looping, we guarantee that $2 m n$ and $m^2-n^2$ are relatively prime. A quick way to check is that (a) $m n$ and $m+n$ are relatively prime and (b) $m$ and $n$ have opposite parity. Making this check is essentially all the work performed by the algorithm: the rest is just looping and counting.

By eliminating the check for duplicates, the new solution doubled the speed once more, from 2400 microseconds to 1300 microseconds. Where does it spend its time? For an argument $p$ (such as $1000$),

  • Approximately $p/2$ calculations of a GCD (for the second loop over n).

  • A loop of length $p/2 / (m(m+n))$ for each combination of $(m,n)$.

An easy upper bound for the total number of iterations is $\frac{p}{8}\log{p}$, demonstrating the $O(p\log{p})$ scaling. If we assume the GCD calculations take an average of $\log{p}$ arithmetic operations each, the total number of operations is less than $p\log{p}$ plus comparable loop overhead together with incrementing an array of counts. The post-processing would merely scan that array for the location of its maximum. At $3 \times 10^9$ operations per second and $p=10^3$, the timing for good compiled code would be 0.3 microseconds. Problems up to $p \approx 10^{10}$ could be handled in reasonable time (under a minute) and without extraordinary amounts of RAM.


Here's an approach based on finding all right angled triangles with a hypotenuse <=500 and measuring the perimeters. The answer is the Commonest perimeter which is less than 1000. This runs in about 1 second.

rats[n_] := DeleteDuplicates[
 Cases[Divisors[n^2, GaussianIntegers -> True], 
 z_Complex /; Abs[z] == n :> Sort[{Re@z, Im@z, n}]]];

Commonest[Select[Flatten[Table[Total[rats[h], {2}], {h, 5, 500}]], 0 < # <= 1000 &]]

(* 840 *)

Update

It is a bit faster to use PowersRepresentations to get the right angled triangles. Here is the solution as a one-liner:

Commonest[Select[
 Flatten[# + Total[Rest@PowersRepresentations[#^2, 2, 2], {2}] & /@ Range[500]],
0 < # <= 1000 &]]

The following is a very straightforward algorithm, implemented using inner loop vectorization technique, which is an adaptation of my reply to this blog post:

fn = 
  Compile[{{p, _Integer}},
     Module[{al = Range[p], bl = 1, c = Range[p], d = Range[p], 
          zeros = 0*Range[p],result = 0*Range[p], rctr = 0},
       For[bl = 1, bl <= p/2 + 1, bl++,
          c = p - al - bl;
          d = UnitStep[-Abs[c*c - bl*bl - al*al]];
          If[d != zeros,
             result[[++rctr]] = First@c[[Pick[Range[Length[d]], d, 1]]];
          ]
       ];
       Union@Take[result, rctr]]
  ];

This function returns a list of hypotenuse lengths for a given perimeter, for example:

fn[120]

(* {50,51,52} *)

What is remarkable about this function is that, due to the vectorized inner loop, compilation to C does not bring any speed enhancements.

Using the above function on my 6 cores, I get:

Position[#,Max[#]]&@ParallelMap[Length[fn[#]]&,Range[1000]]//AbsoluteTiming

(*
   {0.7812500,{{840}}}
*)

while on a single core I get it in about 3.2 seconds.