Skip to content

Transformer: Optimizer and Regularization

Optimizer

  • Adam optimizer with β1 = 0.9, β2 = 0.98 and ϵ = 10−9
  • Varied Learning Rate:

    \[l_{rate} = d^{−0.5}_{model} · min(step\_num^{−0.5}, step\_num · warmup\_steps^{−1.5})\]
    • This corresponds to increasing the learning rate linearly for the first \(warmup\_steps\) training steps, and
    • decreasing it thereafter proportionally to the inverse square root of the step number.
    • They used \(warmup\_steps = 4000\)

Regularization

  • Dropout: They applied dropout to the output of each sub-layer, before it is added to the sub-layer input and normalized.
  • For the base model, They used a dropout rate of 0.1.
  • Label Smoothing: They used label smoothing of value \(\epsilon_{ls} = 0.1\).
  • This hurts perplexity, as the model learns to be more unsure, but improves accuracy and BLEU score. -

Further Reading: - Attention is All You Need - Code and concepts explanation: The Annotated Transformer. Nicely explained code and concepts. - The Illustrated Transformer