What to do when Seq2Seq network repeats words over and over in output?

This type of repetition is called a "text degeneration".

There is a great paper from 2019 which analyse this phenomenon: The Curious Case of Neural Text Degeneration by Ari Holtzman et al. from the Allen Institute for Artificial Intelligence.

The repetition may come from the type of text search (text sampling) on the decoder site. Many people implement this just by the most probable next world proposed by the model (argmax on the softmax on the last layer) or by so called beam search. In fact the beam search is the industry standard for today.

This is the example of Beam search from the article:

Continuation (BeamSearch, b=10):

"The unicorns were able to communicate with each other, they said unicorns. a statement that the unicorns. Professor of the Department of Los Angeles, the most important place the world to be recognition of the world to be a of the world to be a of the world to be a of the world to be a of the world to be a of the world to be a of the world to be a of the world to be a of the…

As you can see there is a great amount of repetition.

According to the paper this curious case may be explained by the fact that each repeated sequence of words have higher probability than the sequence without the next repetition: enter image description here

The article propose some workarounds with words sampling by the decoder. It definitely requires more study, but this is the best explanation we have today.

The other is that your model need still more training. In many cases I faced a similar behaviour when I had big training set and model still couldn't generalise well over whole diversity of the data. To test this hypothesis - try to train on smaller dataset and see if it generalise (produce meaningful results).

But even if your model generalise well enough, that doesn't mean you won't ever face the repetition pattern. Unless you change the sampling patter of the decoder, it is a common scenario.


If you train on a small data then try to decrease the number of parameters, f. e. number of neurons in each layer.

For me, when the network outputs one word all the time, significant decrease of learning rate helps.