Why is Keras LSTM on CPU three times faster than GPU?

I had a similar problem today and found two things that may be helpful to others (this is a regression problem on a data set with ~2.1MM rows, running on a machine with 4 P100 GPUs):

  1. Using the CuDNNLSTM layer instead of the LSTM layer on a GPU machine reduced the fit time from ~13500 seconds to ~400 seconds per epoch.
  2. Increasing the batch size (~500 to ~4700) reduced it to ~130 seconds per epoch.

Reducing the batch size has increase loss and val loss, so you'll need to make a decision about the trade offs you want to make.


Guessing it's just a different, better implementation and, if the implementation is different, you shouldn't expect identical results.

In general, efficiently implementing an algorithm on a GPU is hard and getting maximum performance requires architecture-specific implementations. Therefore, it wouldn't be surprising if an implementation specific to Nvidia's GPUs had enhanced performance versus a general implementation for GPUs. It also wouldn't be surprising that Nvidia would sink significantly more resources into accelerating their code for their GPUs versus than would a team working on a general CNN implementation.

The other possibility is that the data type used on the backend has changed from double- to single- or even half-precision float. The smaller data types mean you can crunch more numbers faster at the cost of accuracy. For NN applications this is often acceptable because no individual number needs to be especially accurate for the net to produce acceptable results.


In Keras, the fast LSTM implementation with CuDNN.

model.add(CuDNNLSTM(units, input_shape=(len(X_train), len(X_train[0])), return_sequences=True))

It can only be run on the GPU with the TensorFlow backend.