Faster way to summarise variables based on column

na.omit does a ton of unnecessary checks and operations. Replacing it with simple is.na call will give you a decent speedup:

concat3 = function(x) {
  x = x[!is.na(x)]
  if (length(x) == 0)
    NA_character_
  else
    paste(x, collapse = ",")
}

Using docendo's data (but with strings instead of factors - factors slow all versions down):

microbenchmark(dt[, lapply(.SD, concat3), by = proid, .SDcols = -c("X4")],
               dt[, lapply(.SD, concat2), by = proid, .SDcols = -c("X4")],
               times = 5)
#Unit: milliseconds
#                                                       expr       min       lq     mean   median       uq      max neval
# dt[, lapply(.SD, concat3), by = proid, .SDcols = -c("X4")]  960.2475 1079.177 1251.545 1342.684 1402.571 1473.045     5
# dt[, lapply(.SD, concat2), by = proid, .SDcols = -c("X4")] 1718.8121 1892.696 2159.148 2171.772 2470.205 2542.253     5

Edit note: removed original part of my answer that didn't address NA treatment & added a benchmark.

concat2 <- function(x) if(all(is.na(x))) NA_character_ else paste(na.omit(x), collapse = ",")

Using data.table:

setDT(df)[, lapply(.SD, concat2), by = proid, .SDcols = -c("X4")]
#   proid    X1  X2  X3
#1:     1 zz,cd a,s e,f
#2:     2 ff,ta g,b z,h
#3:     3    NA   t   e

Using dplyr:

df %>% group_by(proid) %>% summarise_each(funs(concat2), -X4)

Benchmark, smaller data than in actual use case and not fully representative, so just wanted to get an impression how concat2 compares to concat etc..

library(microbenchmark)
library(dplyr)
library(data.table)

N <- 1e6
x <- c(letters, LETTERS)
df <- data.frame(
  proid = sample(1e4, N, TRUE),
  X1 = sample(sample(c(x, NA), N, TRUE)),
  X2 = sample(sample(c(x, NA), N, TRUE)),
  X3 = sample(sample(c(x, NA), N, TRUE)),
  X4 = sample(sample(c(x, NA), N, TRUE))
  )

dt <- as.data.table(df)

concat <- function(x){
  x <- na.omit(x)
  if(length(x)==0){
    return(as.character(NA))
  }else{
    return(paste(x,collapse=","))
  }
}

concat2 <- function(x) if(all(is.na(x))) NA_character_ else paste(na.omit(x), collapse = ",")

concat.dplyr <- function(){
  df %>% group_by(proid) %>% summarise_each(funs(concat), -X4)
}

concat2.dplyr <- function(){
  df %>% group_by(proid) %>% summarise_each(funs(concat2), -X4)
}

concat.data.table <- function(){
  dt[, lapply(.SD, concat), by = proid, .SDcols = -c("X4")]
}

concat2.data.table <- function(){
  dt[, lapply(.SD, concat2), by = proid, .SDcols = -c("X4")]
}


microbenchmark(concat.dplyr(), 
               concat2.dplyr(), 
               concat.data.table(), 
               concat2.data.table(),
               unit = "relative",
               times = 10L)
Unit: relative
                 expr      min       lq   median       uq      max neval
       concat.dplyr() 1.058839 1.058342 1.083728 1.105907 1.080883    10
      concat2.dplyr() 1.057991 1.065566 1.109099 1.145657 1.079201    10
  concat.data.table() 1.024101 1.018443 1.093604 1.085254 1.066560    10
 concat2.data.table() 1.000000 1.000000 1.000000 1.000000 1.000000    10

Findings: data.table performs a little bit faster than dplyr on the sample data and concat2 is a little faster than concat. However, differences remain rather small on this sample data set.

Faster way to summarise variables based on column

Tags:

R

Data.Table

Dplyr

Related

Recent Posts