Parse very large CSV files with C++

I am assuming you are using only one thread.

Multithreading can speedup your process.

Best accomplishment so far is 40 sec. Let's stick to that.

I have assumed that first you read then you process -> ( about 7 secs to read whole file)

7 sec for reading 33 sec for processing

First of all you can divide your file into chunks, let's say 50MB. That means that you can start processing after reading 50MB of file. You do not need to wait till whole file is finished. That's 0.35 sec for reading ( now it is 0.35 + 33 second for processing = cca 34sec )

When you use Multithreading, you can process multiple chunks at a time. That can speedup process theoretically up to number of your cores. Let's say you have 4 cores. That's 33/4 = 8.25 sec.

I think you can speed up you processing with 4 cores up to 9 s. in total.

Look at QThreadPool and QRunnable or QtConcurrent I would prefer QThreadPool

Divide task into parts:

  1. First try to loop over file and divide it into chunks. And do nothing with it.
  2. Then create "ChunkProcessor" class which can process that chunk
  3. Make "ChunkProcessor" a subclass of QRunnable and in reimplemented run() function execute your process
  4. When you have chunks, you have class which can process them and that class is QThreadPool compatible, you can pass it into

It could look like this

loopoverfile {
  whenever chunk is ready {
     ChunkProcessor *chunkprocessor = new ChunkProcessor(chunk);
     QThreadPool::globalInstance()->start(chunkprocessor);
     connect(chunkprocessor, SIGNAL(finished(std::shared_ptr<ProcessedData>)), this, SLOT(readingFinished(std::shared_ptr<ProcessedData>)));
  }   
}

You can use std::share_ptr to pass processed data in order not to use QMutex or something else and avoid serialization problems with multiple thread access to some resource.

Note: in order to use custom signal you have to register it before use

qRegisterMetaType<std::shared_ptr<ProcessedData>>("std::shared_ptr<ProcessedData>");

Edit: (based on discussion, my answer was not clear about that) It does not matter what disk you use or how fast is it. Reading is single thread operation. This solution was suggested only because it took 7 sec to read and again does not matter what disk it is. 7 sec is what's count. And only purpose is to start processing as soon as possible and not to wait till reading is finished.

You can use:

QByteArray data = file.readAll();

Or you can use principal idea: ( I do not know why it take 7 sec to read, what is behind it )

 QFile file("in.txt");
 if (!file.open(QIODevice::ReadOnly | QIODevice::Text))
   return;

 QByteArray* data = new QByteArray;    
 int count = 0;
 while (!file.atEnd()) {
   ++count;
   data->append(file.readLine());
   if ( count > 10000 ) {
     ChunkProcessor *chunkprocessor = new ChunkProcessor(data);
     QThreadPool::globalInstance()->start(chunkprocessor);
     connect(chunkprocessor, SIGNAL(finished(std::shared_ptr<ProcessedData>)), this, SLOT(readingFinished(std::shared_ptr<ProcessedData>)));
     data = new QByteArray; 
     count = 0;
   }
 }

One file, one thread, read almost as fast as read by line "without" interruption. What you do with data is another problem, but has nothing to do with I/O. It is already in memory. So only concern would be 5GB file and ammout of RAM on the machine.

It is very simple solution all you need is subclass QRunnable, reimplement run function, emit signal when it is finished, pass processed data using shared pointer and in main thread joint that data into one structure or whatever. Simple thread safe solution.


I would propose a multi-thread suggestion with a slight variation is that one thread is dedicated to reading file in predefined (configurable) size of chunks and keeps on feeding data to a set of threads (more than one based cpu cores). Let us say that the configuration looks like this:

chunk size = 50 MB
Disk Thread = 1
Process Threads = 5

  1. Create a class for reading data from file. In this class, it holds a data structure which is used to communicate with process threads. For example this structure would contain starting offset, ending offset of the read buffer for each process thread. For reading file data, reader class holds 2 buffers each of chunk size (50 MB in this case)
  2. Create a process class which holds a pointers (shared) for the read buffers and offsets data structure.
  3. Now create driver (probably main thread), creates all the threads and waiting on their completion and handles the signals.
  4. Reader thread is invoked with reader class, reads 50 MB of the data and based on number of threads creates offsets data structure object. In this case t1 handles 0 - 10 MB, t2 handles 10 - 20 MB and so on. Once ready, it notifies processor threads. It then immediately reads next chunk from disk and waits for processor thread to completion notification from process threads.
  5. Processor threads on the notification, reads data from buffer and processes it. Once done, it notifies reader thread about completion and waits for next chunk.
  6. This process completes till the whole data is read and processed. Then reader thread notifies back to the main thread about completion which sends PROCESS_COMPLETION, upon all threads exits. or main thread chooses to process next file in the queue.

Note that offsets are taken for easy explanation, offsets to line delimiter mapping needs to be handled programmatically.