Understanding tf.contrib.lite.TFLiteConverter quantization parameters

What is happenning when only post_training_quantize = True is set? i.e. why 1st case work fine, but second don't.

In TF 1.14, this seems to just quantize the weights stored on disk, in the .tflite file. This does not, by itself, set the inference mode to quantized inference.

i.e., You can have a tflite model which has inference type float32 but the model weights are quantized (using post_training_quantize=True) for the sake of lower disk size, and faster loading of the model at runtime.

How to estimate mean, std and range parameters for second case?

The documentation is confusing to many. Let me explain what I concluded after some research :

  1. Unfortunately quantization parameters/stats has 3 equivalent forms/representations across the TF library and documentation :
    • A) (mean, std_dev)
    • B) (zero_point, scale)
    • C) (min,max)
  2. Conversion from B) and A):
    • std_dev = 1.0 / scale
    • mean = zero_point
  3. Conversion from C) to A):
    • mean = 255.0*min / (min - max)
    • std_dev = 255.0 / (max - min)
    • Explanation: quantization stats are parameters used for mapping the range (0,255) to an arbitrary range, you can start from the 2 equations: min / std_dev + mean = 0 and max / std_dev + mean = 255, then follow the math to reach the above conversion formulas
  4. Conversion from A) to C):
    • min = - mean * std_dev
    • max = (255 - mean) * std_dev
  5. The naming "mean" and "std_dev" are confusing and are largely seen as misnomers.

To answer your question: , if your input image has :

  • range (0,255) then mean = 0, std_dev = 1
  • range (-1,1) then mean = 127.5, std_dev = 127.5
  • range (0,1) then mean = 0, std_dev = 255

Looks like in second case model inference is faster, is it depend on the fact that model input is uint8?

Yes, possibly. However quantized models are typically slower unless you make use of vectorized instructions of your specific hardware. TFLite is optimized to run those specialized instruction for ARM processors. As of TF 1.14 or 1.15 if you are running this on your local machine x86 Intel or AMD, then I'd be surprised if the quantized model runs faster. [Update: It's on TFLite's roadmap to add first-class support for x86 vectorized instructions to make quantized inference faster than float]

What means 'quantization': (0.0, 0) in 1st case and 'quantization': (0.003921568859368563, 0),'quantization': (0.7843137383460999, 128) in 2nd case?

Here this has the format is quantization: (scale, zero_point)

In your first case, you only activated post_training_quantize=True, and this doesn't make the model run quantized inference, so there is no need to transform the inputs or the outputs from float to uint8. Thus quantization stats here are essentially null, which is represented as (0,0).

In the second case, you activated quantized inference by providing inference_type = tf.contrib.lite.constants.QUANTIZED_UINT8. So you have quantization parameters for both input and output, which are needed to transform your float input to uint8 on the way in to the model, and the uint8 output to a float output on the way out.

  • At input, do the transformation: uint8_array = (float_array / std_dev) + mean
  • At output, do the transformation: float_array = (uint8_array.astype(np.float32) - mean) * std_dev
  • Note .astype(float32) this is necessary in python to get correct calculation
  • Note that other texts may use scale instead of std_dev so the divisions will become multiplications and vice versa.

Another confusing thing here is that, even though during conversion you specify quantization_stats = (mean, std_dev), the get_output_details will return quantization: (scale, zero_point), not just the form is different (scale vs std_dev) but also the order is different!

Now to understand these quantization parameter values you got for the input and output, let's use the formulas above to deduce the range of real values ((min,max)) of your inputs and outputs. Using the above formulas we get :

  • Input range : min = 0, max=1 (it is you who specified this by providing quantized_input_stats = {input_node_names[0]: (0.0, 255.0)} # (mean, stddev) )
  • Output range: min = -100.39, max=99.6

1) See documantation. In short, this technique allows you to get a quantized uint8 graph with an accuracy of work that is close to the original one and does not require further training of the quantized model. However, the speed is noticeably less than if conventional quantization were used.

2) If your model has trained with normalized [-1.0, 1.0] input you should set converter.quantized_input_stats = {input_node_names[0]: (128, 127)}, and after that quantization of input tensor will be close to (0.003921568859368563, 0). mean is the integer value from 0 to 255 that maps to floating point 0.0f. std_dev is 255 / (float_max - float_min). This will fix one possible problem

3) Uint8 neural network inference is about 2 times faster (based on device), then float32 inference