Fast linear regression by group

If all you want is coefficients, I'd just use user_id as a factor in the regression. Using @miles2know's simulated data code (though renaming since an object other than exp() sharing that name looks weird to me)

dat <- data.frame(id = rep(c("a","b","c"), each = 20),
                  x = rnorm(60,5,1.5),
                  y = rnorm(60,2,.2))

mod = lm(y ~ x:id + id + 0, data = dat)

We fit no global intercept (+ 0) so that the intercept for each id is the id coefficient, and no x by itself, so that the x:id interactions are the slopes for each id:

coef(mod)
#      ida      idb      idc    x:ida    x:idb    x:idc 
# 1.779686 1.893582 1.946069 0.039625 0.033318 0.000353 

So, for the a level of id, the ida coefficient, 1.78, is the intercept and the x:ida coefficient, 0.0396, is the slope.

I'll leave the gathering of these coefficients into appropriate columns of a data frame to you...

This solution should be very fast because you're not having to deal with subsets of data frames. It could probably be sped up even more with fastLm or such.

Note on scalability:

I did just try this on @nrussell's simulated full-size data and ran into memory allocation issues. Depending on how much memory you have it may not work in one go, but you could probably do it in batches of user ids. Some combination of his answer and my answer might be the fastest overall---or nrussell's might just be faster---expanding the user id factor into thousands of dummy variables might not be computationally efficient, as I've been waiting more than a couple minutes now for a run on just 5000 user ids.


You can just use the basic formulas for calculating slope and regression. lm does a lot of unnecessary things if all you care about are those two numbers. Here I use data.table for the aggregation, but you could do it in base R as well (or dplyr):

system.time(
  res <- DT[, 
    {
      ux <- mean(x)
      uy <- mean(y)
      slope <- sum((x - ux) * (y - uy)) / sum((x - ux) ^ 2)
      list(slope=slope, intercept=uy - slope * ux)
    }, by=user.id
  ]
)

Produces for 500K users ~30 obs each (in seconds):

 user  system elapsed 
 7.35    0.00    7.36 

Or about 15 microseconds per user.

Update: I ended up writing a bunch of blog posts that touch on this as well.

And to confirm this is working as expected:

> summary(DT[user.id==89663, lm(y ~ x)])$coefficients
             Estimate Std. Error   t value  Pr(>|t|)
(Intercept) 0.1965844  0.2927617 0.6714826 0.5065868
x           0.2021210  0.5429594 0.3722580 0.7120808
> res[user.id == 89663]
   user.id    slope intercept
1:   89663 0.202121 0.1965844

Data:

set.seed(1)
users <- 5e5
records <- 30
x <- runif(users * records)
DT <- data.table(
  x=x, y=x + runif(users * records) * 4 - 2, 
  user.id=sample(users, users * records, replace=T)
)

Update: As pointed out by Dirk, my original approach can be greatly improved upon by specifying x and Y directly rather than using the formula-based interface of fastLm, which incurs (a fairly significant) processing overhead. For comparison, using the original full size data set,

R> system.time({
  dt[,c("lm_b0", "lm_b1") := as.list(
    unname(fastLm(x, Y)$coefficients))
    ,by = "user_id"]
})
#  user  system elapsed 
#55.364   0.014  55.401 
##
R> system.time({
  dt[,c("lm_b0","lm_b1") := as.list(
    unname(fastLm(Y ~ x, data=.SD)$coefficients))
    ,by = "user_id"]
})
#   user  system elapsed 
#356.604   0.047 356.820

this simple change yields roughly a 6.5x speedup.


[Original approach]

There is probably some room for improvement, but the following took about 25 minutes on a Linux VM (2.6 GHz processor), running 64-bit R:

library(data.table)
library(RcppArmadillo)
##
dt[
  ,c("lm_b0","lm_b1") := as.list(
    unname(fastLm(Y ~ x, data=.SD)$coefficients)),
  by=user_id]
##
R> dt[c(1:2, 31:32, 61:62),]
   user_id   x         Y     lm_b0    lm_b1
1:       1 1.0 1674.8316 -202.0066 744.6252
2:       1 1.5  369.8608 -202.0066 744.6252
3:       2 1.0  463.7460 -144.2961 374.1995
4:       2 1.5  412.7422 -144.2961 374.1995
5:       3 1.0  513.0996  217.6442 261.0022
6:       3 1.5 1140.2766  217.6442 261.0022

Data:

dt <- data.table(
  user_id = rep(1:500000,each=30))
##
dt[, x := seq(1, by=.5, length.out=30), by = user_id]
dt[, Y := 1000*runif(1)*x, by = user_id]
dt[, Y := Y + rnorm(
  30, 
  mean = sample(c(-.05,0,0.5)*mean(Y),1), 
  sd = mean(Y)*.25), 
  by = user_id]

Tags:

R

Lm

Dplyr