Relationship between loss and accuracy

Yes, this is possible.

To provide an intuitive example of why this might happen, suppose that your classifier outputs roughly the same probability for classes A and B, and class A has the highest density overall. Within this setting, changing the model’s parameters minimally might turn B into the most probable class. This effect would make the cross-entropy loss vary minimally, since it depends directly on the probability distribution, but the change would be clearly noticed for the accuracy, because it depends on the argmax of the output probability distribution.

As a conclusion, minimizing the cross-entropy loss does not always imply improving the accuracy, mainly because cross-entropy is a smooth function, while the accuracy is non-smooth.


It is possible to get decreasing loss with decreasing accuracy but it is far from being called as a good model. This problem can be resolve up to some extinct using Batch normalization at every conv layer of model.


There are at least 5 reasons which might cause such behavior:

  1. Outliers: imagine that you have 10 exactly the same images and 9 out of them belong to class A and one belongs to class B. In this case, a model will start to assign a high probability of class A to this example because of the majority of examples. But then - a signal from outlier might destabilize model and make accuracy decreasing. In theory, a model should stabilize at assigning score 90% to class A but it might last many epochs.

    Solutions: In order to deal with such examples I advise you to use gradient clipping (you may add such option in your optimizer). If you want to check if this phenomenon occurs - you may check your losses distributions (losses of individual examples from training set) and look for outliers.

  2. Bias: Now imagine that you have 10 exactly the same images but 5 of them have assigned class A and 5 - class B. In this case, a model will try to assign approximately 50%-50% distribution on both of these classes. Now - your model can achieve at most 50% of accuracy here - choosing one class out of two valid.

    Solution: Try to increase the model capacity - very often you have a set of really similar images - adding expressive power might help to discriminate similar examples. Beware of overfitting though. Another solution is to try this strategy in your training. If you want to check if such phenomenon occurs - check the distribution of losses of individual examples. If a distribution would be skewed toward higher values - you are probably suffering from bias.

  3. Class inbalance: Now imagine that 90% of your images belong to class A. In an early stage of your training, your model is mainly concentrating on assigning this class to almost all of examples. This might make individual losses to achieve really high values and destabilize your model by making a predicted distribution more unstable.

    Solution: once again - gradient clipping. Second thing - patience, try simply leaving your model for more epochs. A model should learn more subtle in a further phase of training. And of course - try class balancing - by either assigning sample_weights or class_weights. If you want to check if this phenomenon occurs - check your class distribution.

  4. Too strong regularization: if you set your regularization to be too strict - a training process is mainly concentrated on making your weights to have smaller norm than actually learning interesting insights.

    Solution: add a categorical_crossentropy as a metric and observe if it's also decreasing. If not - then it means that your regularization is too strict - try to assign less weight penalty then.

  5. Bad model design - such behavior might be caused by a wrong model design. There are several good practices which one might apply in order to improve your model:

    Batch Normalization - thanks to this technique you are preventing your model from radical changes of inner network activations. This makes training much more stable and efficient. With a small batch size, this might be also a genuine way of regularizing your model.

    Gradient clipping - this makes your model training much more stable and efficient.

    Reduce bottleneck effect - read this fantastic paper and check if your model might suffer from bottleneck problem.

    Add auxiliary classifiers - if you are training your network from scratch - this should make your features much more meaningful and your training - faster and more efficient.