How can I speed up line by line reading of an ASCII file? (C++)

The C++ and C libraries read stuff off the disk equally fast and are already buffered to compensate for the disk I/O lag. You are not going to make it faster by adding more buffering.

The biggest difference is that C++ streams does a load of manipulations based on the locale. Character conversions/Punctuational etc/etc.

As a result the C libraries will be faster.

Replaced Dead Link

For some reason the linked question was deleted. So I am moving the relevant information here. The linked question was about hidden features in C++.


Though not techncially part of the STL.
The streams library is part of the standard C++ libs.

For streams:

Locales.

Very few people actually bother to learn how to correctly set and/or manipulate the locale of a stream.

The second coolest thing is the iterator templates.
Most specifically for me is the stream iterators, which basically turn the streams into very basic containers that can then be used in conjunction with the standard algorithms.

Examples:

  • Did you know that locales will change the '.' in a decimal number to any other character automatically.
  • Did you know that locales will add a ',' every third digit to make it easy to read.
  • Did you know that locales can be used to manipulate the text on the way through (ie conversion from UTF-16 to UTF-8 (when writting to a file).

etc.

Examples:

  • Adding comma for every three digits
  • Using space as the separator
  • Set the decimal separator
  • Simple output filter
  • Set the current locale
  • Count number of characters sent to output
  • Indent every line
  • UTF-16 (stream) -> UTF-16 (Internal) Converter (untested)

You could get better performance, normally, by increasing the buffer size.

Right after building the ifstream, you can set its internal buffer using:

char LocalBuffer[4096]; // buffer

std::ifstream wordListFile("dictionary.txt");

wordListFile.rdbuf()->pubsetbuf(LocalBuffer, 4096);

Note: rdbuf's result is guaranteed no to be null if the construction of ifstream succeeded

Depending on the memory available, your are strongly encouraged to grow the buffer if possible in order to limit interaction with the HDD and the number of system calls.

I've performed some simple measurements using a little benchmark of my own, you can find the code below (and I am interested in critics):

gcc 3.4.2 on SLES 10 (sp 3)
C : 9.52725e+06
C++: 1.11238e+07
difference: 1.59655e+06

Which gives a slowdown of a whooping 17%.

This takes into account:

  • automatic memory management (no buffer overflow)
  • automatic resources management (no risk to forget to close the file)
  • handling of locale

So, we can argue that streams are slow... but please, don't throw your random piece of code and complains it's slow, optimization is hard work.


Corresponding code, where benchmark is a little utility of my own which measure the time of a repeated execution (here launched for 50 iterations) using gettimeofday.

#include <fstream>
#include <iostream>
#include <iomanip>

#include <cmath>
#include <cstdio>

#include "benchmark.h"

struct CRead
{
  CRead(char const* filename): _filename(filename) {}

  void operator()()
  {
    FILE* file = fopen(_filename, "r");

    int count = 0;
    while ( fscanf(file,"%s", _buffer) == 1 ) { ++count; }

    fclose(file);
  }

  char const* _filename;
  char _buffer[1024];
};

struct CppRead
{
  CppRead(char const* filename): _filename(filename), _buffer() {}

  enum { BufferSize = 16184 };

  void operator()()
  {
    std::ifstream file(_filename);
    file.rdbuf()->pubsetbuf(_buffer, BufferSize);

    int count = 0;
    std::string s;
    while ( file >> s ) { ++count; }
  }

  char const* _filename;
  char _buffer[BufferSize];
};


int main(int argc, char* argv[])
{
  size_t iterations = 1;
  if (argc > 1) { iterations = atoi(argv[1]); }

  char const* filename = "largefile.txt";

  CRead cread(filename);
  CppRead cppread(filename);

  double ctime = benchmark(cread, iterations);
  double cpptime = benchmark(cppread, iterations);

  std::cout << "C  : " << ctime << "\n"
               "C++: " << cpptime << "\n";

  return 0;
}

Quick profiling on my system (linux-2.6.37, gcc-4.5.2, compiled with -O3) shows that I/O is not the bottleneck. Whether using fscanf into a char array followed by dict.insert() or operator>> as in your exact code, it takes about the same time (155 - 160 ms to read a 240k word file).

Replacing gcc's std::unordered_set with std::vector<std::string> in your code drops the execution time to 45 ms (fscanf) - 55 ms (operator>>) for me. Try to profile IO and set insertion separately.


Reading the whole file in one go into memory and then operating on it in would probably be faster as it avoids repeatedly going back to the disk to read another chunk.

Is 0.25s actually a problem? If you're not intending on loading much larger files is there any need to make it faster if it makes it less readable?