Skip to content

Llama

Papers:

Llama 2

  • algorithm for tokenization of text: Bytepair encoding (BPE) algorithm
  • Trained on a vast dataset of 2 trillion tokens
  • Architecture: Transformer
    • pre-normalization with RMSNorm
    • SwiGLU activation function
    • Rotary Positional Embedding
    • KV Cache
  • Training:
    • AdamW optimizer, incorporates a cosine learning rate schedule with a warm-up period of 2000 steps, and decays the final learning rate to 10% of the peak learning rate.
    • It applies a weight decay of 0.1 and gradient clipping.
  • Fine-tuning:
    • upervised Fine-Tuning (SFT) and
    • Reinforcement Learning with Human Feedback (RLHF) components.
      1. Proximal Policy
      1. Rejection Sampling Fine-Tuning
    • G Host Attention(GAtt)
    • The issue of context loss in multi-turn conversations has been acknowledged and addressed by Meta through the implementation of the GAtt (GHost Attention) method.
    • This method involved artificially concatenating instructions to all user messages in the conversation.
    • Subsequently, Meta used the latest RLHF (Reinforcement Learning with Human Feedback) model to sample from this augmented dataset. This process resulted in the acquisition of context-rich dialogues and corresponding samples, which were employed for fine-tuning the model, somewhat similar to the concept of Rejection Sampling. The overall outcome demonstrated enhanced attention compared to the existing model. It’s worth noting that this approach was specifically evaluated on 70B models.

Normalization Layer : RMSNorm

Let's take hidden Layer1 and layer2. Both learn based on the input data and the weights.

Need:

  • L2 learns from the output of L1 and the weights.
  • IF L1-output is not normalized, the output of L1 will be very large or very small.
  • Output of L1 depends on the input data and the weights.
  • This will make the learning of L2 very difficult.
  • L2 will have to learn from any range of values.
  • This will make the learning of L2 very slow.

Shapes:

  • \(output = X * W^T\) + b
  • Shape of \(X = (batch\_size, n\_features) = (10,512)\)
  • Shape of \(W = (n_neurons, n_features) = (5, 512)\)\(W^T = (512, 5)\)
  • Shape of output = (batch_size, n_neurons) = (10, 5) → these 5 neurons will become features for the next layer.
  • shape of b = (n_neurons) = (5) → bias for each neuron. Will be broadcasted to (10,5)(to all 10 samples)

(Yet to complete the above example with the RMSNorm layer.)

Transformer vs Llama

Transformer Llama 2
Norm Layer Layer Norm RMSNorm
Order of layers Attention --> Norm Norm --> Attention
Position Encoding Sinusoidal Rotary
Activation ReLU SwiGLU

Comparison of Llama 1, Llama 2, and Original Transformer Architectures

Model Model Size Params Dimension n Heads n Layers Learning Rate Batch Size n Tokens context length
Llama 1 7B 4096 32 32 3.0e-4 4M 1.0T 2k
13B 5120 40 40 3.0e-4 4M 1.0T 2k
33B 6656 52 60 1.5e-4 4M 1.4T 2k
65B 8192 64 80 1.5e-4 4M 1.4T 2k
Original Transformer
Base 65M 512 8 6
Big 213M 1024 16 6
Llama 2 7B 3.0e-4 2.0T 4k
13B 3.0e-4 2.0T 4k
33B 1.5e-4 2.0T 4k
70B 1.5e-4 2.0T 4k

* n Tokens: Number of tokens in the training dataset