Skip to content

ML Interview Preparation : Notes and Resources

Transformer: Optimizer and Regularization

Transformer: Optimizer and Regularization

Optimizer

Adam optimizer with β1 = 0.9, β2 = 0.98 and ϵ = 10−9
Varied Learning Rate:

\[l_{rate} = d^{−0.5}_{model} · min(step\_num^{−0.5}, step\_num · warmup\_steps^{−1.5})\]
- This corresponds to increasing the learning rate linearly for the first \(warmup\_steps\) training steps, and
- decreasing it thereafter proportionally to the inverse square root of the step number.
- They used \(warmup\_steps = 4000\)

Regularization

Dropout: They applied dropout to the output of each sub-layer, before it is added to the sub-layer input and normalized.
For the base model, They used a dropout rate of 0.1.
Label Smoothing: They used label smoothing of value \(\epsilon_{ls} = 0.1\).
This hurts perplexity, as the model learns to be more unsure, but improves accuracy and BLEU score. -

Further Reading: - Attention is All You Need - Code and concepts explanation: The Annotated Transformer. Nicely explained code and concepts. - The Illustrated Transformer