Keras inconsistent prediction time

While I can't explain the inconsistencies in execution time, I can recommend that you try to convert your model to TensorFlow Lite to speed up predictions on single data records or small batches.

I ran a benchmark on this model:

model = tf.keras.models.Sequential([
    tf.keras.layers.Dense(384, activation='elu', input_shape=(256,)),
    tf.keras.layers.Dense(384, activation='elu'),
    tf.keras.layers.Dense(256, activation='elu'),
    tf.keras.layers.Dense(128, activation='elu'),
    tf.keras.layers.Dense(32, activation='tanh')

The prediction times for single records were:

  1. model.predict(input): 18ms
  2. model(input): 1.3ms
  3. Model converted to TensorFlow Lite: 43us

The time to convert the model was 2 seconds.

The class below shows how to convert and use the model and provides a predict method like the Keras model. Note that it would need to be modified for use with models that don’t just have a single 1-D input and a single 1-D output.

class LiteModel:

    def from_file(cls, model_path):
        return LiteModel(tf.lite.Interpreter(model_path=model_path))

    def from_keras_model(cls, kmodel):
        converter = tf.lite.TFLiteConverter.from_keras_model(kmodel)
        tflite_model = converter.convert()
        return LiteModel(tf.lite.Interpreter(model_content=tflite_model))

    def __init__(self, interpreter):
        self.interpreter = interpreter
        input_det = self.interpreter.get_input_details()[0]
        output_det = self.interpreter.get_output_details()[0]
        self.input_index = input_det["index"]
        self.output_index = output_det["index"]
        self.input_shape = input_det["shape"]
        self.output_shape = output_det["shape"]
        self.input_dtype = input_det["dtype"]
        self.output_dtype = output_det["dtype"]

    def predict(self, inp):
        inp = inp.astype(self.input_dtype)
        count = inp.shape[0]
        out = np.zeros((count, self.output_shape[1]), dtype=self.output_dtype)
        for i in range(count):
            self.interpreter.set_tensor(self.input_index, inp[i:i+1])
            out[i] = self.interpreter.get_tensor(self.output_index)[0]
        return out

    def predict_single(self, inp):
        """ Like predict(), but only for a single record. The input data can be a Python list. """
        inp = np.array([inp], dtype=self.input_dtype)
        self.interpreter.set_tensor(self.input_index, inp)
        out = self.interpreter.get_tensor(self.output_index)
        return out[0]

The complete benchmark code and a plot can be found here:

TF2 generally exhibits poor and bug-like memory management in several instances I've encountered - brief description here and here. With prediction in particular, the most performant feeding method is via model(x) directly - see here, and its linked discussions.

In a nutshell: model(x) acts via its its __call__ method (which it inherits from base_layer.Layer), whereas predict(), predict_classes(), etc. involve a dedicated loop function via _select_training_loop(); each utilize different data pre- and post-processing methods suited for different use-cases, and model(x) in 2.1 was designed specifically to yield fastest small-model / small-batch (and maybe any-size) performance (and still fastest in 2.0).

Quoting a TensorFlow dev from linked discussions:

You can predict the output using model call, not model predict, i.e., calling model(x) would make this much faster because there are no "conversion to dataset" part, and also it's directly calling a cached tf.function.

Note: this should be less of an issue in 2.1, and especially 2.2 - but test each method anyway. Also I realize this doesn't directly answer your question on the time spikes; I suspect it's related to Eager caching mechanisms, but the surest way to determine is via TF Profiler, which is broken in 2.1.

Update: regarding increasing spikes, possible GPU throttling; you've done ~1000 iters, try 10,000 instead - eventually, the increasing should stop. As you noted in your comments, this doesn't occur with model(x); makes sense as one less GPU step is involved ("conversion to dataset").

Update2: you could bug the devs here about it if you face this issue; it's mostly me singing there