# Keras inconsistent prediction time

While I can't explain the inconsistencies in execution time, I can recommend that you try to convert your model to TensorFlow Lite to speed up predictions on single data records or small batches.

I ran a benchmark on this model:

```
model = tf.keras.models.Sequential([
tf.keras.layers.Dense(384, activation='elu', input_shape=(256,)),
tf.keras.layers.Dense(384, activation='elu'),
tf.keras.layers.Dense(256, activation='elu'),
tf.keras.layers.Dense(128, activation='elu'),
tf.keras.layers.Dense(32, activation='tanh')
])
```

The prediction times for single records were:

`model.predict(input)`

: 18ms`model(input)`

: 1.3ms- Model converted to TensorFlow Lite: 43us

The time to convert the model was 2 seconds.

The class below shows how to convert and use the model and provides a `predict`

method like the Keras model. Note that it would need to be modified for use with models that don’t just have a single 1-D input and a single 1-D output.

```
class LiteModel:
@classmethod
def from_file(cls, model_path):
return LiteModel(tf.lite.Interpreter(model_path=model_path))
@classmethod
def from_keras_model(cls, kmodel):
converter = tf.lite.TFLiteConverter.from_keras_model(kmodel)
tflite_model = converter.convert()
return LiteModel(tf.lite.Interpreter(model_content=tflite_model))
def __init__(self, interpreter):
self.interpreter = interpreter
self.interpreter.allocate_tensors()
input_det = self.interpreter.get_input_details()[0]
output_det = self.interpreter.get_output_details()[0]
self.input_index = input_det["index"]
self.output_index = output_det["index"]
self.input_shape = input_det["shape"]
self.output_shape = output_det["shape"]
self.input_dtype = input_det["dtype"]
self.output_dtype = output_det["dtype"]
def predict(self, inp):
inp = inp.astype(self.input_dtype)
count = inp.shape[0]
out = np.zeros((count, self.output_shape[1]), dtype=self.output_dtype)
for i in range(count):
self.interpreter.set_tensor(self.input_index, inp[i:i+1])
self.interpreter.invoke()
out[i] = self.interpreter.get_tensor(self.output_index)[0]
return out
def predict_single(self, inp):
""" Like predict(), but only for a single record. The input data can be a Python list. """
inp = np.array([inp], dtype=self.input_dtype)
self.interpreter.set_tensor(self.input_index, inp)
self.interpreter.invoke()
out = self.interpreter.get_tensor(self.output_index)
return out[0]
```

The complete benchmark code and a plot can be found here: https://medium.com/@micwurm/using-tensorflow-lite-to-speed-up-predictions-a3954886eb98

TF2 generally exhibits poor and bug-like memory management in several instances I've encountered - brief description here and here. With prediction in particular, the most performant feeding method is via `model(x)`

directly - see here, and its linked discussions.

In a nutshell: `model(x)`

acts via its its `__call__`

method (which it inherits from `base_layer.Layer`

), whereas `predict()`

, `predict_classes()`

, etc. involve a dedicated loop function via `_select_training_loop()`

; each utilize different data pre- and post-processing methods suited for different use-cases, and `model(x)`

in 2.1 was designed specifically to yield fastest small-model / small-batch (and maybe any-size) performance (and still fastest in 2.0).

Quoting a TensorFlow dev from linked discussions:

You can predict the output using model call, not model predict, i.e., calling

`model(x)`

would make this much faster because there are no "conversion to dataset" part, and also it's directly calling a cached`tf.function`

.

*Note*: this should be less of an issue in 2.1, and especially 2.2 - but test each method anyway. Also I realize this doesn't directly answer your question on the time spikes; I suspect it's related to Eager caching mechanisms, but the surest way to determine is via `TF Profiler`

, which is broken in 2.1.

**Update**: regarding *increasing* spikes, possible GPU throttling; you've done ~1000 iters, try 10,000 instead - eventually, the increasing should stop. As you noted in your comments, this doesn't occur with `model(x)`

; makes sense as one less GPU step is involved ("conversion to dataset").

**Update2**: you could bug the devs here about it if you face this issue; it's mostly me singing there