Optimizer and scheduler for BERT fine-tuning

I think it is hardly possible to give a 100% perfect answer, but you can certainly get inspiration from the way other scripts are doing it. The best place to start is the examples/ directory of the huggingface repository itself, where you can for example find this excerpt:

if (step + 1) % args.gradient_accumulation_steps == 0:
    if args.fp16:
        torch.nn.utils.clip_grad_norm_(amp.master_params(optimizer), args.max_grad_norm)
    else:
        torch.nn.utils.clip_grad_norm_(model.parameters(), args.max_grad_norm)

    optimizer.step()
    scheduler.step()  # Update learning rate schedule
    model.zero_grad()
    global_step += 1

If we look at the surrounding parts, this is basically updating the LR schedule every time you do a backwards pass. In the same example, you can also look at the default value for warmup_steps, which is 0. From my understanding, the warmup is not necessarily required when fine-tuning, but I am less certain about this aspect and would check with other scripts as well.


Here you can see a visualization of learning rate changes using get_linear_scheduler_with_warmup.

Referring to this comment: Warm up steps is a parameter which is used to lower the learning rate in order to reduce the impact of deviating the model from learning on sudden new data set exposure.

By default, number of warm up steps is 0.

Then you make bigger steps, because you are probably not near the minima. But as you are approaching the minima, you make smaller steps to converge to it.

Also, note that number of training steps is number of batches * number of epochs, but not just number of epochs. So, basically num_training_steps = N_EPOCHS+1 is not correct, unless your batch_size is equal to the training set size.

You call scheduler.step() every batch, right after optimizer.step(), to update the learning rate.