Speed in Matlab vs. Julia vs. Fortran

What @Ian Bush says in his answer about the dual precision is correct. Moreover,

  • You will likely not need openmp for the kind of parallelization you have done in your code. The Fortran's intrinsic do concurrent() will automatically parallelize the loop for you (when the code is compiled with the parallel flag of the respective compiler).

  • Also, the where elsewhere construct is slow as it often requires the creation of a logical mask array and then applying it in a do-loop. You can use do concurrent() in place of where to both avoid the extra temporary array creation and parallelize the computation on multiple cores.

  • Also, when comparing 64bit precision numbers, it's good to make sure both values are the same type and kind to avoid an implicit type/kind conversion before the comparison is made.

  • Also, the calculation of a_grid(i) - a_grid/R in computing the c array is redundant and can be avoided in the subsequent line.

Here is the modified optimized parallel Fortran code without any OpenMP,

module mod_calc

    use iso_fortran_env, only: dp => real64

    implicit none
    integer, parameter :: len = 1001
    public :: dp, len

contains

    subroutine linspace(from, to, array)
        real(dp), intent(in) :: from, to
        real(dp), intent(out) :: array(:)
        real(dp) :: range
        integer :: n, i
        n = size(array)
        range = to - from
        if (n == 0) return
        if (n == 1) then
            array(1) = from
            return
        end if
        do concurrent(i=1:n)
            array(i) = from + range * (i - 1) / (n - 1)
        end do
    end subroutine

    subroutine calc_val() 

        implicit none

        real(dp) :: bbeta, sigma, R, zeta, tol, test
        real(dp) :: a_grid(len), V_mat(len,2), V_last(len,2), u(len,len,2), c(len,len,2)
        
        integer :: outeriter, i, j, k, sss, next_mat(2,len), fu
        character(len=*), parameter :: FILE_NAME = 'data.txt'   ! File name.

        call linspace(from=0._dp, to=100._dp, array=a_grid)

        bbeta = 0.98_dp
        sigma = 0.5_dp
        R = 1.0_dp/0.98_dp
        zeta = 1.0_dp - 1.0_dp/sigma

        tol = 1e-4_dp
        test = 1000.0_dp
        outeriter = 0

        do concurrent(i=1:len)
            c(1:len,i,2) = a_grid(i) - a_grid/R
            c(1:len,i,1) = c(1:len,i,2) + 20.0_dp
        end do

        u = c**zeta * 1.0_dp/zeta

        do concurrent(i=1:len, j=1:len, k=1:2)
            if (c(i,j,k)<=0._dp) u(i,j,k) = -1e6_dp
        end do

        V_mat = 0.0_dp
        next_mat = 0.0_dp

        do while (test>tol .and. outeriter<20000)

            outeriter = outeriter + 1
            V_last = V_mat

            do concurrent(i=1:len)
                V_mat(i,1)    = maxval(u(:,i,1) + bbeta*V_last(:, 2))
                next_mat(i,1) = maxloc(u(:,i,1) + bbeta*V_last(:, 2),1)
                V_mat(i,2)    = maxval(u(:,i,2) + bbeta*V_last(:, 1))
                next_mat(i,2) = maxloc(u(:,i,2) + bbeta*V_last(:, 1),1)
            end do

            test = maxval(abs(log(V_last/V_mat)))

         end do

         Write( *, * ) test

    end subroutine

end module mod_calc

program main
    use mod_calc
    implicit none
    integer:: clck_counts_beg,clck_rate,clck_counts_end
    call system_clock ( clck_counts_beg, clck_rate )
    call calc_val()
    call system_clock ( clck_counts_end, clck_rate )
    write (*, '("Time = ",f6.3," seconds.")')  (clck_counts_end - clck_counts_beg) / real(clck_rate)
end program main

Compiling your original code with /standard-semantics /F0x1000000000 /O3 /Qip /Qipo /Qunroll /Qunroll-aggressive /inline:all /Ob2 /Qparallel Intel Fortran compiler flags, yields the following timing,

original.exe
Time = 37.284 seconds.

compiling and running the parallel concurrent Fortran code in the above (on at most 4 cores, if any at all is used) yields,

concurrent.exe
Time =  0.149 seconds.

For comparison, this MATLAB's timing,

Value Function converged in 362 steps. 
Elapsed time is 3.575691 seconds.

One last tip: There are several vectorized array computations and loops in the above code that can still be merged together to even further improve the speed of your Fortran code. For example,

        u = c**zeta * 1.0_dp/zeta

        do concurrent(i=1:len, j=1:len, k=1:2)
            if (c(i,j,k)<=0._dp) u(i,j,k) = -1e6_dp
        end do

in the above code can be all merged with the do concurrent loop appearing before it,

        do concurrent(i=1:len)
            c(1:len,i,2) = a_grid(i) - a_grid/R
            c(1:len,i,1) = c(1:len,i,2) + 20.0_dp
        end do

If you decide to do so, then you can define an auxiliary variable inverse_zeta = 1.0_dp / zeta to use in the computation of u inside the loop instead of using * 1.0_dp / zeta, thus avoiding the extra division (which is more costly than multiplication), without degrading the readability of the code.


The reason the fortran is so slow is that it is using quadruple precision - I don't know Julia or Matlab but it looks as though double precision is being used in that case. Further as noted in the comments some of the loop orders are incorrect for Fortran, and also you are not consistent in your use of precision in the Fortran code, most of your constants are single precision. Correcting all these leads to the following:

  • Original: test = 9.83440674663232047922921588613472439E-0005 Time = 31.413 seconds.
  • Optimised: test = 9.8343643237979391E-005 Time = 0.912 seconds.

Note I have turned off parallelisation for these, all results are single threaded. Code is below:

module mod_calc
!!$    use omp_lib
    implicit none
!!$    integer, parameter :: dp = selected_real_kind(33,4931), len = 1001
    integer, parameter :: dp = selected_real_kind(15), len = 1001
    public :: dp, len

    contains

    subroutine linspace(from, to, array)
        real(dp), intent(in) :: from, to
        real(dp), intent(out) :: array(:)
        real(dp) :: range
        integer :: n, i
        n = size(array)
        range = to - from

        if (n == 0) return

        if (n == 1) then
            array(1) = from
            return
        end if

        do i=1, n
            array(i) = from + range * (i - 1) / (n - 1)
        end do
    end subroutine

    subroutine calc_val() 
        real(dp)::  bbeta, sigma, R, zeta, tol, test
        real(dp)::  a_grid(len), V_mat(len,2), V_last(len,2), &
                    u(len,len,2), c(len,len,2)
        
        integer :: outeriter, i, sss, next_mat(2,len), fu
        character(len=*), parameter :: FILE_NAME = 'data.txt'   ! File name.

        call linspace(from=0._dp, to=100._dp, array=a_grid)

        bbeta = 0.98_dp
        sigma = 0.5_dp
        R = 1.0_dp/0.98_dp
        zeta = 1.0_dp - 1.0_dp/sigma

        tol = 1e-4_dp
        test = 1000.0_dp
        outeriter = 0

        do i = 1,len
            c(:,i,1) = a_grid(i) - a_grid/R + 20.0_dp
            c(:,i,2) = a_grid(i) - a_grid/R
        end do

        u = c**zeta * 1.0_dp/zeta

        where (c<=0)
            u = -1e6_dp
        end where

        V_mat = 0.0_dp
        next_mat = 0.0_dp

        do while (test>tol .and. outeriter<20000)
            outeriter = outeriter+1
            V_last = V_mat

            !$OMP PARALLEL DEFAULT(NONE)          &
            !$OMP SHARED(V_mat, next_mat,V_last, u, bbeta)  &
            !$OMP PRIVATE(i)                      
            !$OMP DO SCHEDULE(static)
            do i=1,len 
                V_mat(i,1)    = maxval(u(:,i,1) + bbeta*V_last(:, 2))
                next_mat(i,1) = maxloc(u(:,i,1) + bbeta*V_last(:, 2),1)
                V_mat(i,2)    = maxval(u(:,i,2) + bbeta*V_last(:, 1))
                next_mat(i,2) = maxloc(u(:,i,2) + bbeta*V_last(:, 1),1)
            end do
            !$OMP END DO
            !$OMP END PARALLEL

            test = maxval(abs(log(V_last/V_mat)))
         end do
         Write( *, * ) test
    end subroutine
end module mod_calc

program main
    use mod_calc

    implicit none
    integer:: clck_counts_beg,clck_rate,clck_counts_end
    
!!$    call omp_set_num_threads(2)

    call system_clock ( clck_counts_beg, clck_rate )
    call calc_val()
    call system_clock ( clck_counts_end, clck_rate )
    write (*, '("Time = ",f6.3," seconds.")')  (clck_counts_end - clck_counts_beg) / real(clck_rate)
end program main

Compilation / linking:

ian@eris:~/work/stack$ gfortran --version
GNU Fortran (Ubuntu 7.4.0-1ubuntu1~18.04.1) 7.4.0
Copyright (C) 2017 Free Software Foundation, Inc.
This is free software; see the source for copying conditions.  There is NO
warranty; not even for MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.

ian@eris:~/work/stack$ gfortran -Wall -Wextra -O3   jul.f90 
jul.f90:36:48:

         character(len=*), parameter :: FILE_NAME = 'data.txt'   ! File name.
                                                1
Warning: Unused parameter ‘file_name’ declared at (1) [-Wunused-parameter]
jul.f90:35:57:

         integer :: outeriter, i, sss, next_mat(2,len), fu
                                                         1
Warning: Unused variable ‘fu’ declared at (1) [-Wunused-variable]
jul.f90:35:36:

         integer :: outeriter, i, sss, next_mat(2,len), fu
                                    1
Warning: Unused variable ‘sss’ declared at (1) [-Wunused-variable]

Running:

ian@eris:~/work/stack$ ./a.out
   9.8343643237979391E-005
Time =  0.908 seconds.