Is the "*apply" family really not vectorized?

To me, vectorisation is primarily about making your code easier to write and easier to understand.

The goal of a vectorised function is to eliminate the book-keeping associated with a for loop. For example, instead of:

means <- numeric(length(mtcars))
for (i in seq_along(mtcars)) {
  means[i] <- mean(mtcars[[i]])
}
sds <- numeric(length(mtcars))
for (i in seq_along(mtcars)) {
  sds[i] <- sd(mtcars[[i]])
}

You can write:

means <- vapply(mtcars, mean, numeric(1))
sds   <- vapply(mtcars, sd, numeric(1))

That makes it easier to see what's the same (the input data) and what's different (the function you're applying).

A secondary advantage of vectorisation is that the for-loop is often written in C, rather than in R. This has substantial performance benefits, but I don't think it's the key property of vectorisation. Vectorisation is fundamentally about saving your brain, not saving the computer work.

I agree with Patrick Burns' view that it is rather loop hiding and not code vectorisation. Here's why:

Consider this C code snippet:

for (int i=0; i<n; i++)
  c[i] = a[i] + b[i]

What we would like to do is quite clear. But how the task is performed or how it could be performed isn't really. A for-loop by default is a serial construct. It doesn't inform if or how things can be done in parallel.

The most obvious way is that the code is run in a sequential manner. Load a[i] and b[i] on to registers, add them, store the result in c[i], and do this for each i.

However, modern processors have vector or SIMD instruction set which is capable of operating on a vector of data during the same instruction when performing the same operation (e.g., adding two vectors as shown above). Depending on the processor/architecture, it might be possible to add, say, four numbers from a and b under the same instruction, instead of one at a time.

We would like to exploit the Single Instruction Multiple Data and perform data level parallelism, i.e., load 4 things at a time, add 4 things at time, store 4 things at a time, for example. And this is code vectorisation.

Note that this is different from code parallelisation -- where multiple computations are performed concurrently.

It'd be great if the compiler identifies such blocks of code and automatically vectorises them, which is a difficult task. Automatic code vectorisation is a challenging research topic in Computer Science. But over time, compilers have gotten better at it. You can check the auto vectorisation capabilities of GNU-gcc here. Similarly for LLVM-clang here. And you can also find some benchmarks in the last link compared against gcc and ICC (Intel C++ compiler).

gcc (I'm on v4.9) for example doesn't vectorise code automatically at -O2 level optimisation. So if we were to execute the code shown above, it'd be run sequentially. Here's the timing for adding two integer vectors of length 500 million.

We either need to add the flag -ftree-vectorize or change optimisation to level -O3. (Note that -O3 performs other additional optimisations as well). The flag -fopt-info-vec is useful as it informs when a loop was successfully vectorised).

# compiling with -O2, -ftree-vectorize and  -fopt-info-vec
# test.c:32:5: note: loop vectorized
# test.c:32:5: note: loop versioned for vectorization because of possible aliasing
# test.c:32:5: note: loop peeled for vectorization to enhance alignment

This tells us that the function is vectorised. Here are the timings comparing both non-vectorised and vectorised versions on integer vectors of length 500 million:

x = sample(100L, 500e6L, TRUE)
y = sample(100L, 500e6L, TRUE)
z = vector("integer", 500e6L) # result vector

# non-vectorised, -O2
system.time(.Call("Csum", x, y, z))
#    user  system elapsed 
#   1.830   0.009   1.852

# vectorised using flags shown above at -O2
system.time(.Call("Csum", x, y, z))
#    user  system elapsed 
#   0.361   0.001   0.362

# both results are checked for identicalness, returns TRUE

This part can be safely skipped without losing continuity.

Compilers will not always have sufficient information to vectorise. We could use OpenMP specification for parallel programming, which also provides a simd compiler directive to instruct compilers to vectorise the code. It is essential to ensure that there are no memory overlaps, race conditions etc.. when vectorising code manually, else it'll result in incorrect results.

#pragma omp simd
for (i=0; i<n; i++) 
  c[i] = a[i] + b[i]

By doing this, we specifically ask the compiler to vectorise it no matter what. We'll need to activate OpenMP extensions by using compile time flag -fopenmp. By doing that:

# timing with -O2 + OpenMP with simd
x = sample(100L, 500e6L, TRUE)
y = sample(100L, 500e6L, TRUE)
z = vector("integer", 500e6L) # result vector
system.time(.Call("Cvecsum", x, y, z))
#    user  system elapsed 
#   0.360   0.001   0.360

which is great! This was tested with gcc v6.2.0 and llvm clang v3.9.0 (both installed via homebrew, MacOS 10.12.3), both of which support OpenMP 4.0.

In this sense, even though Wikipedia page on Array Programming mentions that languages that operate on entire arrays usually call that as vectorised operations, it really is loop hiding IMO (unless it is actually vectorised).

In case of R, even rowSums() or colSums() code in C don't exploit code vectorisation IIUC; it is just a loop in C. Same goes for lapply(). In case of apply(), it's in R. All of these are therefore loop hiding.

In short, wrapping an R function by:

just writing a for-loop in C != vectorising your code.
just writing a for-loop in R != vectorising your code.

Intel Math Kernel Library (MKL) for example implements vectorised forms of functions.

HTH

References:

Talk by James Reinders, Intel (this answer is mostly an attempt to summarise this excellent talk)

First of all, in your example you make tests on a "data.frame" which is not fair for colMeans, apply and "[.data.frame" since they have an overhead:

system.time(as.matrix(m))  #called by `colMeans` and `apply`
#   user  system elapsed 
#   1.03    0.00    1.05
system.time(for(i in 1:ncol(m)) m[, i])  #in the `for` loop
#   user  system elapsed 
#  12.93    0.01   13.07

On a matrix, the picture is a bit different:

mm = as.matrix(m)
system.time(colMeans(mm))
#   user  system elapsed 
#   0.01    0.00    0.01 
system.time(apply(mm, 2, mean))
#   user  system elapsed 
#   1.48    0.03    1.53 
system.time(for(i in 1:ncol(mm)) mean(mm[, i]))
#   user  system elapsed 
#   1.22    0.00    1.21

Regading the main part of the question, the main difference between lapply/mapply/etc and straightforward R-loops is where the looping is done. As Roland notes, both C and R loops need to evaluate an R function in each iteration which is the most costly. The really fast C functions are those that do everything in C, so, I guess, this should be what "vectorised" is about?

An example where we find the mean in each of a "list"s elements:

(EDIT May 11 '16 : I believe the example with finding the "mean" is not a good setup for the differences between evaluating an R function iteratively and compiled code, (1) because of the particularity of R's mean algorithm on "numeric"s over a simple sum(x) / length(x) and (2) it should make more sense to test on "list"s with length(x) >> lengths(x). So, the "mean" example is moved to the end and replaced with another.)

As a simple example we could consider the finding of the opposite of each length == 1 element of a "list":

In a tmp.c file:

#include <R.h>
#define USE_RINTERNALS 
#include <Rinternals.h>
#include <Rdefines.h>

/* call a C function inside another */
double oppC(double x) { return(ISNAN(x) ? NA_REAL : -x); }
SEXP sapply_oppC(SEXP x)
{
    SEXP ans = PROTECT(allocVector(REALSXP, LENGTH(x)));
    for(int i = 0; i < LENGTH(x); i++) 
        REAL(ans)[i] = oppC(REAL(VECTOR_ELT(x, i))[0]);

    UNPROTECT(1);
    return(ans);
}

/* call an R function inside a C function;
 * will be used with 'f' as a closure and as a builtin */    
SEXP sapply_oppR(SEXP x, SEXP f)
{
    SEXP call = PROTECT(allocVector(LANGSXP, 2));
    SETCAR(call, install(CHAR(STRING_ELT(f, 0))));

    SEXP ans = PROTECT(allocVector(REALSXP, LENGTH(x)));     
    for(int i = 0; i < LENGTH(x); i++) { 
        SETCADR(call, VECTOR_ELT(x, i));
        REAL(ans)[i] = REAL(eval(call, R_GlobalEnv))[0];
    }

    UNPROTECT(2);
    return(ans);
}

And in R side:

system("R CMD SHLIB /home/~/tmp.c")
dyn.load("/home/~/tmp.so")

with data:

set.seed(007)
myls = rep_len(as.list(c(NA, runif(3))), 1e7)

#a closure wrapper of `-`
oppR = function(x) -x

for_oppR = compiler::cmpfun(function(x, f)
{
    f = match.fun(f)  
    ans = numeric(length(x))
    for(i in seq_along(x)) ans[[i]] = f(x[[i]])
    return(ans)
})

Benchmarking:

#call a C function iteratively
system.time({ sapplyC =  .Call("sapply_oppC", myls) }) 
#   user  system elapsed 
#  0.048   0.000   0.047 

#evaluate an R closure iteratively
system.time({ sapplyRC =  .Call("sapply_oppR", myls, "oppR") }) 
#   user  system elapsed 
#  3.348   0.000   3.358 

#evaluate an R builtin iteratively
system.time({ sapplyRCprim =  .Call("sapply_oppR", myls, "-") }) 
#   user  system elapsed 
#  0.652   0.000   0.653 

#loop with a R closure
system.time({ forR = for_oppR(myls, "oppR") })
#   user  system elapsed 
#  4.396   0.000   4.409 

#loop with an R builtin
system.time({ forRprim = for_oppR(myls, "-") })
#   user  system elapsed 
#  1.908   0.000   1.913 

#for reference and testing 
system.time({ sapplyR = unlist(lapply(myls, oppR)) })
#   user  system elapsed 
#  7.080   0.068   7.170 
system.time({ sapplyRprim = unlist(lapply(myls, `-`)) }) 
#   user  system elapsed 
#  3.524   0.064   3.598 

all.equal(sapplyR, sapplyRprim)
#[1] TRUE 
all.equal(sapplyR, sapplyC)
#[1] TRUE
all.equal(sapplyR, sapplyRC)
#[1] TRUE
all.equal(sapplyR, sapplyRCprim)
#[1] TRUE
all.equal(sapplyR, forR)
#[1] TRUE
all.equal(sapplyR, forRprim)
#[1] TRUE

(Follows the original example of mean finding):

#all computations in C
all_C = inline::cfunction(sig = c(R_ls = "list"), body = '
    SEXP tmp, ans;
    PROTECT(ans = allocVector(REALSXP, LENGTH(R_ls)));

    double *ptmp, *pans = REAL(ans);

    for(int i = 0; i < LENGTH(R_ls); i++) {
        pans[i] = 0.0;

        PROTECT(tmp = coerceVector(VECTOR_ELT(R_ls, i), REALSXP));
        ptmp = REAL(tmp);

        for(int j = 0; j < LENGTH(tmp); j++) pans[i] += ptmp[j];

        pans[i] /= LENGTH(tmp);

        UNPROTECT(1);
    }

    UNPROTECT(1);
    return(ans);
')

#a very simple `lapply(x, mean)`
C_and_R = inline::cfunction(sig = c(R_ls = "list"), body = '
    SEXP call, ans, ret;

    PROTECT(call = allocList(2));
    SET_TYPEOF(call, LANGSXP);
    SETCAR(call, install("mean"));

    PROTECT(ans = allocVector(VECSXP, LENGTH(R_ls)));
    PROTECT(ret = allocVector(REALSXP, LENGTH(ans)));

    for(int i = 0; i < LENGTH(R_ls); i++) {
        SETCADR(call, VECTOR_ELT(R_ls, i));
        SET_VECTOR_ELT(ans, i, eval(call, R_GlobalEnv));
    }

    double *pret = REAL(ret);
    for(int i = 0; i < LENGTH(ans); i++) pret[i] = REAL(VECTOR_ELT(ans, i))[0];

    UNPROTECT(3);
    return(ret);
')                    

R_lapply = function(x) unlist(lapply(x, mean))                       

R_loop = function(x) 
{
    ans = numeric(length(x))
    for(i in seq_along(x)) ans[i] = mean(x[[i]])
    return(ans)
} 

R_loopcmp = compiler::cmpfun(R_loop)


set.seed(007); myls = replicate(1e4, runif(1e3), simplify = FALSE)
all.equal(all_C(myls), C_and_R(myls))
#[1] TRUE
all.equal(all_C(myls), R_lapply(myls))
#[1] TRUE
all.equal(all_C(myls), R_loop(myls))
#[1] TRUE
all.equal(all_C(myls), R_loopcmp(myls))
#[1] TRUE

microbenchmark::microbenchmark(all_C(myls), 
                               C_and_R(myls), 
                               R_lapply(myls), 
                               R_loop(myls), 
                               R_loopcmp(myls), 
                               times = 15)
#Unit: milliseconds
#            expr       min        lq    median        uq      max neval
#     all_C(myls)  37.29183  38.19107  38.69359  39.58083  41.3861    15
#   C_and_R(myls) 117.21457 123.22044 124.58148 130.85513 169.6822    15
#  R_lapply(myls)  98.48009 103.80717 106.55519 109.54890 116.3150    15
#    R_loop(myls) 122.40367 130.85061 132.61378 138.53664 178.5128    15
# R_loopcmp(myls) 105.63228 111.38340 112.16781 115.68909 128.1976    15

Is the "*apply" family really not vectorized?

Tags:

Performance

Loops

R

Apply

Vectorization

Related

Recent Posts