How to compile procedural program effectively

In the fortran version, you have the large loop within the executable and you call the executable only once. Also quite important: You allocate memory for result only once. In order to make the benchmark fair, you should compare to something like

f5 = Compile[{{ls, _Real, 1}},
  Module[{n = Length@ls, tp},
    tp = Table[0., {Length[ls]}];
    Do[
     tp[[1]] = ls[[2]];
     tp[[n]] = ls[[n - 1]];
     Do[tp[[i]] = ls[[i - 1]] + ls[[i + 1]], {i, 2, n - 1}];
     , {20000}];
    tp],
   CompilationTarget -> "C",
   RuntimeOptions -> "Speed"
];

On my machine, calling f5[ls] takes about 0.067 s while Do[f4[ls];, {20000}] needs about 0.58 s. The fortran variant (see below) compiled with gfortran -o bla test2.f90 -O3 needs about 0.012 s. Moreover, the call f6[ls] with the function defined below needs only 0.034 seconds which is not too far away from the fortran timing.

f6 = Compile[{{ls, _Real, 1}},
   Module[{n = Length@ls, tp},
    tp = Table[0., {n}];
    Do[
     tp[[1]] = Compile`GetElement[ls, 2];
     tp[[n]] = Compile`GetElement[ls, n - 1];
     Do[
      tp[[i]] = Compile`GetElement[ls, i - 1] + Compile`GetElement[ls, i + 1],
      {i, 2, n - 1}], 
    {20000}];
    tp],
   CompilationTarget -> "C",
   RuntimeOptions -> "Speed"
   ];

test2.f90

program main
implicit none
integer,parameter :: N0 = 2000
integer i,j
real (kind=8) :: list(N0), result(N0),start,finish
do i = 1, N0
    list(i) = sin(i/real(N0))
end do
call CPU_TIME(start)
do j = 1, 20000
    result(1) = list(2)
    result(N0) = list(N0-1)
    do i=2,N0-1
        result(i) = list(i+1) + list(i-1)
    end do
end do
call CPU_TIME(finish)
write(*,*) result
write(*,*) finish-start
end program

PS.: I have still not used parallelization, here. Even with the dullest way to do that within Mathematica, I get (on a Quad Core CPU):

AbsoluteTiming[ParallelDo[f6[ls], {i, 1, $KernelCount}]][[1]]/($KernelCount)
(* 0.0125135 *)

and by even increasing the number of jobs:

AbsoluteTiming[ParallelDo[f6[ls], {i, 1, $KernelCount 10}]][[ 1]]/($KernelCount 10)
(* 0.00855335 *)

Admittedly, it is not exactly fair to compare this to the unparallelized fortran code. I added this only to show another possibility to speed up the Mathematica code.


This is an answer to Q1 and partly Q4, really. I can't test your Fortran version at the moment, but it would be an interesting comparison.

You can improve the performance of f4 compared to f1 and f2 by setting RuntimeOptions -> "Speed". Clearly the change in runtime settings (mainly "CatchMachineIntegerOverflow" it seems...) from the defaults has a different effect on the two functions.

For instance:

f2 = Compile[{{ls1, _Real, 1}}, 
   Append[Rest@ls1, 0.] + Prepend[Most@ls1, 0.], 
   CompilationTarget -> "C", RuntimeOptions -> "Speed"];

f4 = Compile[{{ls, _Real, 1}}, 
   Module[{n = Length@ls, tp = ls}, tp[[1]] = ls[[2]];
    tp[[n]] = ls[[n - 1]];
    Do[tp[[i]] = ls[[i - 1]] + ls[[i + 1]], {i, 2, n - 1}];
    tp], CompilationTarget -> "C", RuntimeOptions -> "Speed"];

AbsoluteTiming[TimeConstrained[Do[f2[ls];, {20000}], 5]]
(* 0.152 seconds *)

AbsoluteTiming[TimeConstrained[Do[f4[ls];, {20000}], 5]]
(* 0.127 seconds *)