Get the number of lines in a text file using R

I found this easy way using R.utils package

library(R.utils)
sapply(myfiles,countLines)

here is how it works


If you:

  • still want to avoid the system call that a system2("wc"… will cause
  • are on BSD/Linux or OS X (I didn't test the following on Windows)
  • don't mind a using a full filename path
  • are comfortable using the inline package

then the following should be about as fast as you can get (it's pretty much the 'line count' portion of wc in an inline R C function):

library(inline)

wc.code <- "
uintmax_t linect = 0; 
uintmax_t tlinect = 0;

int fd, len;
u_char *p;

struct statfs fsb;

static off_t buf_size = SMALL_BUF_SIZE;
static u_char small_buf[SMALL_BUF_SIZE];
static u_char *buf = small_buf;

PROTECT(f = AS_CHARACTER(f));

if ((fd = open(CHAR(STRING_ELT(f, 0)), O_RDONLY, 0)) >= 0) {

  if (fstatfs(fd, &fsb)) {
    fsb.f_iosize = SMALL_BUF_SIZE;
  }

  if (fsb.f_iosize != buf_size) {
    if (buf != small_buf) {
      free(buf);
    }
    if (fsb.f_iosize == SMALL_BUF_SIZE || !(buf = malloc(fsb.f_iosize))) {
      buf = small_buf;
      buf_size = SMALL_BUF_SIZE;
    } else {
      buf_size = fsb.f_iosize;
    }
  }

  while ((len = read(fd, buf, buf_size))) {

    if (len == -1) {
      (void)close(fd);
      break;
    }

    for (p = buf; len--; ++p)
      if (*p == '\\n')
        ++linect;
  }

  tlinect += linect;

  (void)close(fd);

}
SEXP result;
PROTECT(result = NEW_INTEGER(1));
INTEGER(result)[0] = tlinect;
UNPROTECT(2);
return(result);
";

setCMethod("wc",
           signature(f="character"), 
           wc.code,
           includes=c("#include <stdlib.h>", 
                      "#include <stdio.h>",
                      "#include <sys/param.h>",
                      "#include <sys/mount.h>",
                      "#include <sys/stat.h>",
                      "#include <ctype.h>",
                      "#include <err.h>",
                      "#include <errno.h>",
                      "#include <fcntl.h>",
                      "#include <locale.h>",
                      "#include <stdint.h>",
                      "#include <string.h>",
                      "#include <unistd.h>",
                      "#include <wchar.h>",
                      "#include <wctype.h>",
                      "#define SMALL_BUF_SIZE (1024 * 8)"),
           language="C",
           convention=".Call")

wc("FULLPATHTOFILE")

It'd be better as a package since it actually has to compile the first time through. But, it's here for reference if you really do need "speed". For a 189,955 line file I had lying around, I get (mean values from a bunch of runs):

   user  system elapsed 
  0.007   0.003   0.010 

You can count the number of newline characters (\n, will also work for \r\n on Windows) in a file. This will give you a correct answer iff:

  1. There is a newline char at the end of last line (BTW, read.csv gives a warning if this doesn't hold)
  2. The table does not contain a newline character in the data (e.g. within quotes)

I'll suffice to read the file in parts. Below I set chunk (tmp buf) size of 65536 bytes:

f <- file("filename.csv", open="rb")
nlines <- 0L
while (length(chunk <- readBin(f, "raw", 65536)) > 0) {
   nlines <- nlines + sum(chunk == as.raw(10L))
}
print(nlines)
close(f)

Benchmarks on a ca. 512 MB ASCII text file, 12101000 text lines, Linux:

  • readBin: ca. 2.4 s.

  • @luis_js's wc-based solution: 0.1 s.

  • read.delim: 39.6 s.

  • EDIT: reading a file line by line with readLines (f <- file("/tmp/test.txt", open="r"); nlines <- 0L; while (length(l <- readLines(f, 128)) > 0) nlines <- nlines + length(l); close(f)): 32.0 s.

Tags:

File

Text Files

R