Fast calculation of CDF / rolling join on multiple columns

Just a note on the alternative, and yet, obvious solution:

set.seed(2016)
dt <- data.table(x=rnorm(20000), y=rnorm(20000), z=rnorm(20000))

system.time({
    dt <- t(as.matrix(dt))

    bounds <- as.matrix(expand.grid(z=seq(-2,2,0.1),
                                    y=seq(-2,2,0.1),
                                    x=seq(-2,2,0.1)))

    bounds <- bounds[,ncol(bounds):1]

    n_d <- ncol(bounds)

    x <- apply(bounds,
               1,
               function(x) sum(colSums(dt < x) == n_d))
})

This solution on my machine takes roughly twice as long to compute as JoeBase and OldBenDT solutions posted. The main difference? Memory usage. It is more processor-bound.

I don't know a precise way to compare memory usage in R, but the memory.size(max=T) function reported using 5Gb of memory for those previous approaches (not the non-equi-join approach), whilst only using 40Mb of memory for the apply approach (note: I used 20000 points in the sample distribution).

I think this has important implications for the scale of the computation you can execute.


Update

Below, I have provided a general base R solution (it will work on non-uniform grids). It is was faster than the fastest published solution provided by the OP (more on this later). As the OP intimates, generating the N.cum column is the real bottleneck, so I focused my efforts on this task alone (i.e. generating CDF is a trivial task once N.cum is obtained).

JoeBase <- function(dtt, s) {
    m <- matrix(c(dtt$x, dtt$y, dtt$z), ncol = 3)
    N.Cum <- array(vector(mode = "integer"), dim = rev(sapply(s, length)))
    for (i in seq_along(s[[1]])) {
        t1 <- m[,1] <= s[[1]][i]
        for (j in seq_along(s[[2]])) {
            t2 <- t1 & (m[,2] <= s[[2]][j])
            for (k in seq_along(s[[3]])) {
                N.Cum[k,j,i] <- sum(t2 & (m[,3] <= s[[3]][k]))
            }
        }
    }
    as.vector(N.Cum)
}

The above algorithm takes advantage of vectorized operations, notably the creation and utilization of the logical vectors t1 and t2. This vector is used to obtain the number of rows that met the criteria for all 3 columns in the original data.table. We simply rely on the internal coercion by R from a logical vector to an integral vector by the action of sum.

Figuring out how to populate the 3-dimensional integer array N.Cum was a bit of a challenge, as it is later going to be converted to a vector via as.vector. This took a bit of trial and error to learn how as.vector behaves. To my surprise, the "last" and "first" dimension must be permuted in order for the coercion to a vector to take place faithfully (The first few times through, I had N.Cum[i,j,k] instead of N.Cum[k,j,i]).

First, lets test equality:

library(data.table)
## Here is the function I used to test against. I included the generation
## of "bounds" and "bg" as "result" depends on both of these (N.B. "JoeBase" does not)
BenDT <- function(dt, s) {
    X <- data.table(X=s[[1]]); X[, x := X]
    Y <- data.table(Y=s[[2]]); Y[, y := Y]
    Z <- data.table(Z=s[[3]]); Z[, z := Z]

    dt <- X[dt, on="x", roll=-Inf, nomatch=0]
    dt <- Y[dt, on="y", roll=-Inf, nomatch=0]
    dt <- Z[dt, on="z", roll=-Inf, nomatch=0]
    bg <- dt[, .N, keyby=list(X, Y, Z)]

    bounds <- CJ(X=X$X, Y=Y$Y, Z=Z$Z)

    kl <- bg[bounds, on=c("X", "Y", "Z")]
    kl[is.na(N), N := 0]

    # Counting
    kl[, CountUntil.XgivenYZ := cumsum(N), by=list(Y, Z)]
    kl[, CountUntil.XYgivenZ := cumsum(CountUntil.XgivenYZ), by=list(X, Z)]
    kl[, CountUntil.XYZ := cumsum(CountUntil.XYgivenZ), by=list(X, Y)]

    # Cleanup
    setnames(kl, "CountUntil.XYZ", "N.cum")
    kl[, CDF := N.cum/nrow(dt)]
    kl
}

t1 <- BenDT(dt, seq(-2,2,0.1))
t2 <- JoeBase(dt, seq(-2,2,0.1))

all.equal(t1$N.cum, t2)
[1] TRUE

Now, we test for speed. First we will compile both functions using cmpfun from the compiler package. The first benchmark reflects the efficiency on smaller examples.

library(compiler)
c.JoeBase <- cmpfun(JoeBase)
c.BenDT <- cmpfun(BenDT)
c.OldBenDT <- cmpfun(OldBenDT)  ## The previous best solution that Ben contributed

st <- list(seq(-2, 2, 0.1), seq(-2, 2, 0.1), seq(-2, 2, 0.1))
microbenchmark(c.BenDT(dt, st), c.OldBenDT(dt, st), c.JoeBase(dt, st), times = 10)
Unit: milliseconds
              expr        min         lq       mean    median         uq        max neval cld
   c.BenDT(dt, st)   34.24872   34.78908   38.87775   37.4924   43.37179   46.12859    10 a  
c.OldBenDT(dt, st) 1485.68178 1532.35878 1607.96669 1593.9813 1619.58908 1845.75876    10  b 
 c.JoeBase(dt, st) 1880.71648 1962.38160 2049.43985 2007.4880 2169.93078 2281.02118    10   c

Below is the old test.
However, when the number of bins increases, c.JoeBase really starts to dominate (over 5x faster).

st <- list(seq(-5, 5, 0.1), seq(-5, 5, 0.1), seq(-5, 5, 0.1))
microbenchmark(c.JoeBase(dt, st), c.OldBenDT(dt, st), times = 5)
Unit: seconds
              expr       min        lq      mean    median        uq       max neval cld
 c.JoeBase(dt, st)  23.50927  23.53809  29.61145  24.52748  30.81485  45.66759     5  a 
c.OldBenDT(dt, st) 110.60209 123.95285 133.74601 124.97929 125.96186 183.23394     5   b

After performing further tests, I have some misgivings about the results (@Ben has pointed out a similar sentiment in the comments). I'm pretty sure c.JoeBase appears to be faster only because of the limitations of my old computer. As @stephematician pointed out in his answer, the original solution is memory intensive and if you were to simply perform a system.time on c.OldBenDT, you will see that the majority of the time is being spent in the system category and the user category is comparable to the user category of c.JoeBase. My 6 year old Mac has only 4GB of ram and I surmise that a lot of memory swaps are taking place with these operations. Observe:

## test with very tiny buckets (i.e. 0.025 instead of 0.1 above)
st <- list(seq(-1.5, 1.5, 0.025), seq(-1.5, 1.5, 0.025), seq(-1.5, 1.5, 0.025))  
system.time(c.JoeBase(dt, st))
  user  system elapsed 
36.407   4.748  41.170

system.time(c.OldBenDT(dt, st))
   user  system elapsed 
49.653  77.954 475.304

system.time(c.BenDT(dt, st))  ## Ben's new solution is lightning fast
  user  system elapsed 
 0.603   0.063   0.668

Regardless, @Ben's latest solution is far superior. Check out these new benchmarks:

st <- list(seq(-5, 5, 0.1), seq(-5, 5, 0.1), seq(-5, 5, 0.1))
microbenchmark(c.JoeBase(dt, st), BenDT(dt, st), times = 5)
Unit: milliseconds
             expr        min         lq       mean     median        uq        max neval cld
c.JoeBase(dt, st) 26517.0944 26855.7819 28341.5356 28403.7871 29926.213 30004.8018     5   b
    BenDT(dt, st)   342.4433   359.8048   400.3914   379.5319   423.336   496.8411     5  a

Yet another victory for data.table.


Figured it out.

# Step1 - map each sample to the nearest X, Y, and Z above it. (In other words, bin the data.)

X <- data.table(X=seq(-2, 2, by=.1)); X[, x := X]
Y <- data.table(Y=seq(-2, 2, by=.1)); Y[, y := Y]
Z <- data.table(Z=seq(-2, 2, by=.1)); Z[, z := Z]

dt <- X[dt, on="x", roll=-Inf, nomatch=0]
dt <- Y[dt, on="y", roll=-Inf, nomatch=0]
dt <- Z[dt, on="z", roll=-Inf, nomatch=0]

# Step2 - aggregate by unique (X, Y, Z) triplets and count the samples directly below each of these bounds.
bg <- dt[, .N, keyby=list(X, Y, Z)]

# Step4 - Get the count of samples directly below EVERY (X, Y, Z) bound
bounds <- CJ(X=X$X, Y=Y$Y, Z=Z$Z)
kl <- bg[bounds, on=c("X", "Y", "Z")]
kl[is.na(N), N := 0]

# Step5 (the tricky part) - Consider a single (Y, Z) pair. X will be in ascending order. So we can do a cumsum on X for each (Y, Z) to count x <= X | Y,Z. Now if you hold X and Z fixed, you can do a cumsum on Y (which is also in ascending order) to count x <= X, y <= Y | Z. And then just continue this process.
kl[, CountUntil.XgivenYZ := cumsum(N), by=list(Y, Z)]
kl[, CountUntil.XYgivenZ := cumsum(CountUntil.XgivenYZ), by=list(X, Z)]
kl[, CountUntil.XYZ := cumsum(CountUntil.XYgivenZ), by=list(X, Y)]

# Cleanup
setnames(kl, "CountUntil.XYZ", "N.cum")
kl[, CDF := N.cum/nrow(dt)]

Generalization

For anyone who wants it, I generalized this to work with any number of variables and dumped the function into my R package, mltools.

For example, to solve this problem you can do

library(mltools)

bounds <- list(x=seq(-2, 2, by=.1), y=seq(-2, 2, by=.1), z=seq(-2, 2, by=.1))
empirical_cdf(x=dt, ubounds=bounds)
        x  y    z N.cum   CDF
    1: -2 -2 -2.0     0 0.000
    2: -2 -2 -1.9     0 0.000
    3: -2 -2 -1.8     0 0.000
    4: -2 -2 -1.7     0 0.000
    5: -2 -2 -1.6     0 0.000
   ---                       
68917:  2  2  1.6   899 0.899
68918:  2  2  1.7   909 0.909
68919:  2  2  1.8   917 0.917
68920:  2  2  1.9   924 0.924
68921:  2  2  2.0   929 0.929

Tags:

R

Data.Table