Fast and efficient character DataFrame creation in Rcpp

Thanks for making a snapshot of the data available (BTW: no point tar'ing a single file, you could just have xz'ed the csvfile. Anyway.)

I get different results on my Ubuntu 20.04 box which are closer to what I anticipated:

  • data.table::fread() is competitive as we expected (I am running data.table from git as there was a regression in the most recent release)
  • vroom and stringfish, once we force materialization to compare apples to apples rather than images of apples, are about the same
  • Rcpp is in the ballpark too but a little more variable

I capped it at 10 runs, the variability probably comes down if you run more but the caching influences it too.

In short: no clear winners, and surely no mandate to replace one of the (alreadty known to be tuned) reference implementations.

edd@rob:~/git/stackoverflow/65043010(master)$ Rscript bm.R
Unit: seconds
  expr     min      lq    mean  median      uq     max neval cld
 fread 1.37294 1.51211 1.54004 1.55138 1.57639 1.62939    10   a
 vroom 1.44670 1.53659 1.62104 1.61172 1.61764 1.88921    10   a
 sfish 1.21609 1.57000 1.57635 1.60180 1.63933 1.72975    10   a
 rcpp1 1.44111 1.45354 1.61275 1.55190 1.60535 2.15847    10   a
 rcpp2 1.47902 1.57970 1.75067 1.60114 1.64857 2.75851    10   a
edd@rob:~/git/stackoverflow/65043010(master)$ 

Code for top-level script

suppressMessages({
    library(data.table)
    library(Rcpp)
    library(vroom)
    library(stringfish)
    library(microbenchmark)
})

vroomread <- function(csvfile) {
    a <- vroom(csvfile, col_types = "cc", progress = FALSE)
    vroom:::vroom_materialize(a, TRUE)
}
sfread <- function(csvfile) {
    a <- sf_readLines(csvfile)
    dt <- data.table::data.table(uns = sf_substr(a, 1, 81),
                                 sol = sf_substr(a, 83, 163))
}

sourceCpp("rcppfuncs.cpp")


csvfile <- "sudoku_100k.csv"
microbenchmark(fread=fread(csvfile),
               vroom=vroomread(csvfile),
               sfish=sfread(csvfile),
               rcpp1=setalloccol(read_to_df_ifstream(csvfile)),
               rcpp2=setalloccol(read_to_df_ifstream_charvector(csvfile)),
               times=10)

Code for Rcpp script

#include <Rcpp.h>
#include <fstream>

//[[Rcpp::export]]

Rcpp::DataFrame read_to_df_ifstream(std::string filename) {
  const int n_lines = 1000000;
  std::ifstream file(filename, std::ifstream::in);

  std::string line;
  // burn the header
  std::getline(file, line);

  std::vector<std::string> a, b;
  a.reserve(n_lines);
  b.reserve(n_lines);

  while (std::getline(file, line)) {
    a.push_back(line.substr(0, 80));
    b.push_back(line.substr(82, 162));
  }

  Rcpp::List df(2);
  df.names() = Rcpp::CharacterVector::create("unsolved", "solved");

  df["unsolved"] = a;
  df["solved"] = b;

  df.attr("class") = Rcpp::CharacterVector::create("data.table", "data.frame");

  return df;
}

//[[Rcpp::export]]
Rcpp::DataFrame read_to_df_ifstream_charvector(std::string filename) {
  const int n_lines = 1000000;
  std::ifstream file(filename, std::ifstream::in);

  std::string line;
  // burn the header
  std::getline(file, line);

  Rcpp::CharacterVector a(n_lines), b(n_lines);

  int l = 0;
  while (std::getline(file, line)) {
    a(l) = line.substr(0, 80);
    b(l) = line.substr(82, 162);
    l++;
  }

  Rcpp::List df(2);
  df.names() = Rcpp::CharacterVector::create("unsolved", "solved");

  df["unsolved"] = a;
  df["solved"] = b;

  df.attr("class") = Rcpp::CharacterVector::create("data.table", "data.frame");

  return df;
}

This is not really a proper answer to my question, more some thoughts I didn't want to go wasted as well as some benchmarks. Maybe useful to someone who faces a similar issue.

To recall, the basic idea is to read 1 million rows of two 81 character long strings into an R object (preferably a data.frame, data.table, or tibble). For the benchmarks I have used the 1 million sudoku dataset of Kyubyong Park.

I structured the answer into two parts: 1) using other R packages and 2) using Rcpp/C++ and C to work on a lower level.

Surprisingly, for character data specialised packages such as stringi, stringfish, or vroom are really efficient and beat (my) lower level C++/C code.

Important to note is that some packages use ALTREP (see for example Francoise take on them here), which means that the data does not materialize in R until needed. I.e., loading the data using vroom takes less than 1 second, but the first operations (which need to materialize the data) take way longer... To circumnavigate this, I either force the materialization of the data by putting it into a data.table or use an internal function of vroom to force it.

1) R packages

data.table and fread - 75 secs

Mainly as a base benchmark.

file <- "sudokus/sudoku_1m.csv"
tictoc::tic()
dt <- data.table::fread(file, colClasses = "character")
tictoc::toc()
#> 75.296 sec elapsed

Vroom with materialization - 19 secs

Note that vroom uses ALTREP, forcing materialization to level the playing field!

file <- "sudokus/sudoku_1m.csv"
tictoc::tic()
a <- vroom::vroom(file, col_types = "cc", progress = FALSE)
# internal function that materializes the ALTREP data
df <- vroom:::vroom_materialize(a, TRUE)
tictoc::toc()
#> 19.926 sec elapsed

Stringfish - 19 secs

Stringfish uses ALTREP, so reading the data and taking the substrings takes less than one second. Materialization takes the rest, similar to vroom.

library(stringfish)
file <- "sudokus/sudoku_1m.csv"
tictoc::tic()
a <- sf_readLines(file)

dt <- data.table::data.table(
  uns = sf_substr(a, 1, 81),
  sol = sf_substr(a, 83, 163)
)
tictoc::toc()
#> 19.698 sec elapsed

Stringi - 22 secs

Note that the conversion to data.table takes virtually no time.

tictoc::tic()
a <- stringi::stri_read_lines(file)
# discard header
a <- a[-1]

dt <- data.table::data.table(
  uns = stringi::stri_sub(a, 1, 81),
  sol = stringi::stri_sub(a, 83, 163)
)
tictoc::toc() 
#> 22.409 sec elapsed

2) C and Cpp functions

Rcpp with ifstream read to STL first - 22 secs

//[[Rcpp::export]]
Rcpp::DataFrame read_to_df_ifstream(std::string filename) {
  const int n_lines = 1000000;
  std::ifstream file(filename);

  std::string line;
  // burn the header
  std::getline(file, line);

  std::vector<std::string> a, b;
  a.reserve(n_lines);
  b.reserve(n_lines);

  while (std::getline(file, line)) {
    a.push_back(line.substr(0, 80));
    b.push_back(line.substr(82, 162));
  }

  Rcpp::List df(2);
  df.names() = Rcpp::CharacterVector::create("unsolved", "solved");

  df["unsolved"] = a;
  df["solved"] = b;

  df.attr("class") = Rcpp::CharacterVector::create("data.table", "data.frame");

  return df;
}

/*** R
tictoc::tic()
file <- "sudokus/sudoku_1m.csv"
raw <- read_to_df_ifstream(file)
dt <- data.table::setalloccol(raw)
tictoc::toc()
#> 22.098 sec elapsed
*/

Rcpp with ifstream read directly to Rcpp::CharacterVector - 21 secs

//[[Rcpp::export]]
Rcpp::DataFrame read_to_df_ifstream_charvector(std::string filename) {
  const int n_lines = 1000000;
  std::ifstream file(filename);

  std::string line;
  // burn the header
  std::getline(file, line);

  Rcpp::CharacterVector a(n_lines), b(n_lines);

  int l = 0;
  while (std::getline(file, line)) {
    a(l) = line.substr(0, 80);
    b(l) = line.substr(82, 162);
    l++;
  }

  Rcpp::List df(2);
  df.names() = Rcpp::CharacterVector::create("unsolved", "solved");

  df["unsolved"] = a;
  df["solved"] = b;

  df.attr("class") = Rcpp::CharacterVector::create("data.table", "data.frame");

  return df;
}

/*** R
tictoc::tic()
file <- "sudokus/sudoku_1m.csv"
raw <- read_to_df_ifstream_charvector(file)
dt <- data.table::setalloccol(raw)
tictoc::toc()
#> 21.436 sec elapsed
*/

Rcpp with buffer - 75 secs

This is basically the initial approach I chose, as outlined in the question above. Not really sure why its slower than the others...

//[[Rcpp::export]]
Rcpp::DataFrame read_to_df_buffer(std::string filename) {
  const int max_buffer_size = 1e8;
  const int header_size = 18; // only fixed in this example...
  const int n_lines = 1000000;

  FILE* infile;
  infile = fopen(filename.c_str(), "r");
  if (infile == NULL) Rcpp::stop("File Error!\n");

  fseek(infile, 0L, SEEK_END);
  int64_t file_size = ftell(infile);
  fseek(infile, 0L, SEEK_SET);

  // initiate the buffers
  char* buffer;
  int64_t buffer_size = sizeof(char) * max_buffer_size > file_size
    ? file_size : max_buffer_size;
  buffer = (char*) malloc(buffer_size);

  // skip the header...
  int64_t this_buffer_size = fread(buffer, 1, header_size, infile);

  // a holds the first part (quizzes or unsolved) b holds solution/solved
  std::vector<std::string> a, b;
  a.resize(n_lines);
  b.resize(n_lines);

  const int line_length = 2 * 82; // 2 times 81 digits plus one , or newline
  int l = 0;
  // fill the buffer
  int current_pos = ftell(infile);
  int next_buffer_size = file_size - current_pos > buffer_size
    ? buffer_size : file_size - current_pos;

  while ((this_buffer_size = fread(buffer, 1, next_buffer_size, infile)) > 0) {
    // read a buffer from current_pos to ftell(infile)
    Rcpp::checkUserInterrupt();
    int i = 0;
    while (i + line_length <= this_buffer_size) {
      a[l] = std::string(buffer + i, buffer + i + 81);
      i += 82;
      b[l] = std::string(buffer + i, buffer + i + 81);;
      i += 82;
      l++;
    }

    if (i == 0) break;
    if (i != this_buffer_size) {
      // file pointer reset by i - this_buffer_size (offset to end of buffer)
      fseek(infile, i - this_buffer_size, SEEK_CUR);
    }
    // determine the next buffer size. If the buffer is too large, take only whats
    // needed
    current_pos = ftell(infile);
    next_buffer_size = file_size - current_pos > buffer_size
      ? buffer_size : file_size - current_pos;
  }

  free(buffer);
  fclose(infile);

  Rcpp::DataFrame df = Rcpp::DataFrame::create(
    Rcpp::Named("unsolved") = a,
    Rcpp::Named("solved") = b,
    Rcpp::Named("stringsAsFactors") = false
  );
  return df;
}

/*** R
tictoc::tic()
file <- "sudokus/sudoku_1m.csv"
raw <- read_to_df_buffer(file)
tictoc::toc()
75.915 sec elapsed
*/

Using Rs C API - 125 secs

Not sure why this is not faster, probably because my C code is not efficient... If you have any improvements, I'll gladly update the timings.

The mkChar() function creates a CHARSXP which can be inserted into a character vector STRSXP. Note that most R characters are stored in a cache (see also 1.10 of R Internals), maybe if we can circumvent the cache we can gain some speedups - not sure how to do this or if this is wise in any way...

Preferably, I would like to pre allocate 1 mln STRSXP of size 81, memcpy() the values from the C array, and SET_STRING_ELT() them to the vector. No idea how to do it, though.

See also:

  • https://cran.r-project.org/doc/manuals/r-release/R-ints.html
  • http://adv-r.had.co.nz/C-interface.html
  • https://github.com/hadley/r-internals/
read_to_list_sexp <- inline::cfunction(c(fname = "character"), '
  const char * filename = CHAR(asChar(fname));

  FILE* infile;
  infile = fopen(filename, "r");
  if (infile == NULL) error("File cannot be opened");

  fseek(infile, 0L, SEEK_END);
  int64_t file_size = ftell(infile);
  fseek(infile, 0L, SEEK_SET);

  const int n_lines = 1000000;

  SEXP uns = PROTECT(allocVector(STRSXP, n_lines));
  SEXP sol = PROTECT(allocVector(STRSXP, n_lines));

  char * line = NULL;
  size_t len = 0;
  ssize_t read;

  int l = 0;

  char char_array[82];
  char_array[81] = 0;
  // skip header
  read = getline(&line, &len, infile);

  while ((read = getline(&line, &len, infile)) != -1) {
    memcpy(char_array, line, 81);
    SET_STRING_ELT(uns, l, mkChar(char_array));

    memcpy(char_array, line + 82, 81);
    SET_STRING_ELT(sol, l, mkChar(char_array));

    l++;
    if (l == n_lines) break;
  }
  fclose(infile);

  SEXP res = PROTECT(allocVector(VECSXP, 2));

  SET_VECTOR_ELT(res, 0, uns);
  SET_VECTOR_ELT(res, 1, sol);

  UNPROTECT(3);
  return res;
')

file <- "sudokus/sudoku_1m.csv"
tictoc::tic()
a <- foo(file)
df <- data.table::as.data.table(a)
tictoc::toc()
#> 125.514 sec elapsed

Tags:

R

Rcpp