Fast and efficient character DataFrame creation in Rcpp

Thanks for making a snapshot of the data available (BTW: no point tar'ing a single file, you could just have xz'ed the csvfile. Anyway.)

I get different results on my Ubuntu 20.04 box which are closer to what I anticipated:

data.table::fread() is competitive as we expected (I am running data.table from git as there was a regression in the most recent release)
vroom and stringfish, once we force materialization to compare apples to apples rather than images of apples, are about the same
Rcpp is in the ballpark too but a little more variable

I capped it at 10 runs, the variability probably comes down if you run more but the caching influences it too.

In short: no clear winners, and surely no mandate to replace one of the (alreadty known to be tuned) reference implementations.

edd@rob:~/git/stackoverflow/65043010(master)$ Rscript bm.R
Unit: seconds
  expr     min      lq    mean  median      uq     max neval cld
 fread 1.37294 1.51211 1.54004 1.55138 1.57639 1.62939    10   a
 vroom 1.44670 1.53659 1.62104 1.61172 1.61764 1.88921    10   a
 sfish 1.21609 1.57000 1.57635 1.60180 1.63933 1.72975    10   a
 rcpp1 1.44111 1.45354 1.61275 1.55190 1.60535 2.15847    10   a
 rcpp2 1.47902 1.57970 1.75067 1.60114 1.64857 2.75851    10   a
edd@rob:~/git/stackoverflow/65043010(master)$

Code for top-level script

suppressMessages({
    library(data.table)
    library(Rcpp)
    library(vroom)
    library(stringfish)
    library(microbenchmark)
})

vroomread <- function(csvfile) {
    a <- vroom(csvfile, col_types = "cc", progress = FALSE)
    vroom:::vroom_materialize(a, TRUE)
}
sfread <- function(csvfile) {
    a <- sf_readLines(csvfile)
    dt <- data.table::data.table(uns = sf_substr(a, 1, 81),
                                 sol = sf_substr(a, 83, 163))
}

sourceCpp("rcppfuncs.cpp")


csvfile <- "sudoku_100k.csv"
microbenchmark(fread=fread(csvfile),
               vroom=vroomread(csvfile),
               sfish=sfread(csvfile),
               rcpp1=setalloccol(read_to_df_ifstream(csvfile)),
               rcpp2=setalloccol(read_to_df_ifstream_charvector(csvfile)),
               times=10)

Code for Rcpp script

#include <Rcpp.h>
#include <fstream>

//[[Rcpp::export]]

Rcpp::DataFrame read_to_df_ifstream(std::string filename) {
  const int n_lines = 1000000;
  std::ifstream file(filename, std::ifstream::in);

  std::string line;
  // burn the header
  std::getline(file, line);

  std::vector<std::string> a, b;
  a.reserve(n_lines);
  b.reserve(n_lines);

  while (std::getline(file, line)) {
    a.push_back(line.substr(0, 80));
    b.push_back(line.substr(82, 162));
  }

  Rcpp::List df(2);
  df.names() = Rcpp::CharacterVector::create("unsolved", "solved");

  df["unsolved"] = a;
  df["solved"] = b;

  df.attr("class") = Rcpp::CharacterVector::create("data.table", "data.frame");

  return df;
}

//[[Rcpp::export]]
Rcpp::DataFrame read_to_df_ifstream_charvector(std::string filename) {
  const int n_lines = 1000000;
  std::ifstream file(filename, std::ifstream::in);

  std::string line;
  // burn the header
  std::getline(file, line);

  Rcpp::CharacterVector a(n_lines), b(n_lines);

  int l = 0;
  while (std::getline(file, line)) {
    a(l) = line.substr(0, 80);
    b(l) = line.substr(82, 162);
    l++;
  }

  Rcpp::List df(2);
  df.names() = Rcpp::CharacterVector::create("unsolved", "solved");

  df["unsolved"] = a;
  df["solved"] = b;

  df.attr("class") = Rcpp::CharacterVector::create("data.table", "data.frame");

  return df;
}

This is not really a proper answer to my question, more some thoughts I didn't want to go wasted as well as some benchmarks. Maybe useful to someone who faces a similar issue.

To recall, the basic idea is to read 1 million rows of two 81 character long strings into an R object (preferably a data.frame, data.table, or tibble). For the benchmarks I have used the 1 million sudoku dataset of Kyubyong Park.

I structured the answer into two parts: 1) using other R packages and 2) using Rcpp/C++ and C to work on a lower level.

Surprisingly, for character data specialised packages such as stringi, stringfish, or vroom are really efficient and beat (my) lower level C++/C code.

Important to note is that some packages use ALTREP (see for example Francoise take on them here), which means that the data does not materialize in R until needed. I.e., loading the data using vroom takes less than 1 second, but the first operations (which need to materialize the data) take way longer... To circumnavigate this, I either force the materialization of the data by putting it into a data.table or use an internal function of vroom to force it.

1) R packages

data.table and `fread` - 75 secs

Mainly as a base benchmark.

file <- "sudokus/sudoku_1m.csv"
tictoc::tic()
dt <- data.table::fread(file, colClasses = "character")
tictoc::toc()
#> 75.296 sec elapsed

Vroom with materialization - 19 secs

Note that vroom uses ALTREP, forcing materialization to level the playing field!

file <- "sudokus/sudoku_1m.csv"
tictoc::tic()
a <- vroom::vroom(file, col_types = "cc", progress = FALSE)
# internal function that materializes the ALTREP data
df <- vroom:::vroom_materialize(a, TRUE)
tictoc::toc()
#> 19.926 sec elapsed

Stringfish - 19 secs

Stringfish uses ALTREP, so reading the data and taking the substrings takes less than one second. Materialization takes the rest, similar to vroom.

library(stringfish)
file <- "sudokus/sudoku_1m.csv"
tictoc::tic()
a <- sf_readLines(file)

dt <- data.table::data.table(
  uns = sf_substr(a, 1, 81),
  sol = sf_substr(a, 83, 163)
)
tictoc::toc()
#> 19.698 sec elapsed

Stringi - 22 secs

Note that the conversion to data.table takes virtually no time.

tictoc::tic()
a <- stringi::stri_read_lines(file)
# discard header
a <- a[-1]

dt <- data.table::data.table(
  uns = stringi::stri_sub(a, 1, 81),
  sol = stringi::stri_sub(a, 83, 163)
)
tictoc::toc() 
#> 22.409 sec elapsed

2) C and Cpp functions

Rcpp with `ifstream` read to STL first - 22 secs

//[[Rcpp::export]]
Rcpp::DataFrame read_to_df_ifstream(std::string filename) {
  const int n_lines = 1000000;
  std::ifstream file(filename);

  std::string line;
  // burn the header
  std::getline(file, line);

  std::vector<std::string> a, b;
  a.reserve(n_lines);
  b.reserve(n_lines);

  while (std::getline(file, line)) {
    a.push_back(line.substr(0, 80));
    b.push_back(line.substr(82, 162));
  }

  Rcpp::List df(2);
  df.names() = Rcpp::CharacterVector::create("unsolved", "solved");

  df["unsolved"] = a;
  df["solved"] = b;

  df.attr("class") = Rcpp::CharacterVector::create("data.table", "data.frame");

  return df;
}

/*** R
tictoc::tic()
file <- "sudokus/sudoku_1m.csv"
raw <- read_to_df_ifstream(file)
dt <- data.table::setalloccol(raw)
tictoc::toc()
#> 22.098 sec elapsed
*/

Rcpp with `ifstream` read directly to `Rcpp::CharacterVector` - 21 secs

//[[Rcpp::export]]
Rcpp::DataFrame read_to_df_ifstream_charvector(std::string filename) {
  const int n_lines = 1000000;
  std::ifstream file(filename);

  std::string line;
  // burn the header
  std::getline(file, line);

  Rcpp::CharacterVector a(n_lines), b(n_lines);

  int l = 0;
  while (std::getline(file, line)) {
    a(l) = line.substr(0, 80);
    b(l) = line.substr(82, 162);
    l++;
  }

  Rcpp::List df(2);
  df.names() = Rcpp::CharacterVector::create("unsolved", "solved");

  df["unsolved"] = a;
  df["solved"] = b;

  df.attr("class") = Rcpp::CharacterVector::create("data.table", "data.frame");

  return df;
}

/*** R
tictoc::tic()
file <- "sudokus/sudoku_1m.csv"
raw <- read_to_df_ifstream_charvector(file)
dt <- data.table::setalloccol(raw)
tictoc::toc()
#> 21.436 sec elapsed
*/

Rcpp with buffer - 75 secs

This is basically the initial approach I chose, as outlined in the question above. Not really sure why its slower than the others...

//[[Rcpp::export]]
Rcpp::DataFrame read_to_df_buffer(std::string filename) {
  const int max_buffer_size = 1e8;
  const int header_size = 18; // only fixed in this example...
  const int n_lines = 1000000;

  FILE* infile;
  infile = fopen(filename.c_str(), "r");
  if (infile == NULL) Rcpp::stop("File Error!\n");

  fseek(infile, 0L, SEEK_END);
  int64_t file_size = ftell(infile);
  fseek(infile, 0L, SEEK_SET);

  // initiate the buffers
  char* buffer;
  int64_t buffer_size = sizeof(char) * max_buffer_size > file_size
    ? file_size : max_buffer_size;
  buffer = (char*) malloc(buffer_size);

  // skip the header...
  int64_t this_buffer_size = fread(buffer, 1, header_size, infile);

  // a holds the first part (quizzes or unsolved) b holds solution/solved
  std::vector<std::string> a, b;
  a.resize(n_lines);
  b.resize(n_lines);

  const int line_length = 2 * 82; // 2 times 81 digits plus one , or newline
  int l = 0;
  // fill the buffer
  int current_pos = ftell(infile);
  int next_buffer_size = file_size - current_pos > buffer_size
    ? buffer_size : file_size - current_pos;

  while ((this_buffer_size = fread(buffer, 1, next_buffer_size, infile)) > 0) {
    // read a buffer from current_pos to ftell(infile)
    Rcpp::checkUserInterrupt();
    int i = 0;
    while (i + line_length <= this_buffer_size) {
      a[l] = std::string(buffer + i, buffer + i + 81);
      i += 82;
      b[l] = std::string(buffer + i, buffer + i + 81);;
      i += 82;
      l++;
    }

    if (i == 0) break;
    if (i != this_buffer_size) {
      // file pointer reset by i - this_buffer_size (offset to end of buffer)
      fseek(infile, i - this_buffer_size, SEEK_CUR);
    }
    // determine the next buffer size. If the buffer is too large, take only whats
    // needed
    current_pos = ftell(infile);
    next_buffer_size = file_size - current_pos > buffer_size
      ? buffer_size : file_size - current_pos;
  }

  free(buffer);
  fclose(infile);

  Rcpp::DataFrame df = Rcpp::DataFrame::create(
    Rcpp::Named("unsolved") = a,
    Rcpp::Named("solved") = b,
    Rcpp::Named("stringsAsFactors") = false
  );
  return df;
}

/*** R
tictoc::tic()
file <- "sudokus/sudoku_1m.csv"
raw <- read_to_df_buffer(file)
tictoc::toc()
75.915 sec elapsed
*/

Using Rs C API - 125 secs

Not sure why this is not faster, probably because my C code is not efficient... If you have any improvements, I'll gladly update the timings.

The mkChar() function creates a CHARSXP which can be inserted into a character vector STRSXP. Note that most R characters are stored in a cache (see also 1.10 of R Internals), maybe if we can circumvent the cache we can gain some speedups - not sure how to do this or if this is wise in any way...

Preferably, I would like to pre allocate 1 mln STRSXP of size 81, memcpy() the values from the C array, and SET_STRING_ELT() them to the vector. No idea how to do it, though.

Fast and efficient character DataFrame creation in Rcpp

Code for top-level script

Code for Rcpp script

1) R packages

data.table and `fread` - 75 secs

Vroom with materialization - 19 secs

Stringfish - 19 secs

Stringi - 22 secs

2) C and Cpp functions

Rcpp with `ifstream` read to STL first - 22 secs

Rcpp with `ifstream` read directly to `Rcpp::CharacterVector` - 21 secs

Rcpp with buffer - 75 secs

Using Rs C API - 125 secs

Tags:

R

Rcpp

Related

Recent Posts

Fast and efficient character DataFrame creation in Rcpp

Code for top-level script

Code for Rcpp script

1) R packages

data.table and fread - 75 secs

Vroom with materialization - 19 secs

Stringfish - 19 secs

Stringi - 22 secs

2) C and Cpp functions

Rcpp with ifstream read to STL first - 22 secs

Rcpp with ifstream read directly to Rcpp::CharacterVector - 21 secs

Rcpp with buffer - 75 secs

Using Rs C API - 125 secs

Tags:

R

Rcpp

Related

data.table and `fread` - 75 secs

Rcpp with `ifstream` read to STL first - 22 secs

Rcpp with `ifstream` read directly to `Rcpp::CharacterVector` - 21 secs