Transformer: Optimizer and Regularization
Optimizer
- Adam optimizer with β1 = 0.9, β2 = 0.98 and ϵ = 10−9
-
Varied Learning Rate:
\[l_{rate} = d^{−0.5}_{model} · min(step\_num^{−0.5}, step\_num · warmup\_steps^{−1.5})\]- This corresponds to increasing the learning rate linearly for the first \(warmup\_steps\) training steps, and
- decreasing it thereafter proportionally to the inverse square root of the step number.
- They used \(warmup\_steps = 4000\)
Regularization
- Dropout: They applied dropout to the output of each sub-layer, before it is added to the sub-layer input and normalized.
- For the base model, They used a dropout rate of 0.1.
- Label Smoothing: They used label smoothing of value \(\epsilon_{ls} = 0.1\).
- This hurts perplexity, as the model learns to be more unsure, but improves accuracy and BLEU score. -
Further Reading: - Attention is All You Need - Code and concepts explanation: The Annotated Transformer. Nicely explained code and concepts. - The Illustrated Transformer