Llama

Papers:

Llama 2

algorithm for tokenization of text: Bytepair encoding (BPE) algorithm
Trained on a vast dataset of 2 trillion tokens
Architecture: Transformer
- pre-normalization with RMSNorm
- SwiGLU activation function
- Rotary Positional Embedding
- KV Cache
Training:
- AdamW optimizer, incorporates a cosine learning rate schedule with a warm-up period of 2000 steps, and decays the final learning rate to 10% of the peak learning rate.
- It applies a weight decay of 0.1 and gradient clipping.
Fine-tuning:
- upervised Fine-Tuning (SFT) and
- Reinforcement Learning with Human Feedback (RLHF) components.
- 1. Proximal Policy
- 1. Rejection Sampling Fine-Tuning
- G Host Attention(GAtt)
- The issue of context loss in multi-turn conversations has been acknowledged and addressed by Meta through the implementation of the GAtt (GHost Attention) method.
- This method involved artificially concatenating instructions to all user messages in the conversation.
- Subsequently, Meta used the latest RLHF (Reinforcement Learning with Human Feedback) model to sample from this augmented dataset. This process resulted in the acquisition of context-rich dialogues and corresponding samples, which were employed for fine-tuning the model, somewhat similar to the concept of Rejection Sampling. The overall outcome demonstrated enhanced attention compared to the existing model. It’s worth noting that this approach was specifically evaluated on 70B models.

Let's take hidden Layer1 and layer2. Both learn based on the input data and the weights.

Need:

L2 learns from the output of L1 and the weights.
IF L1-output is not normalized, the output of L1 will be very large or very small.
Output of L1 depends on the input data and the weights.
This will make the learning of L2 very difficult.
L2 will have to learn from any range of values.
This will make the learning of L2 very slow.

Shapes:

\(output = X * W^T\) + b
Shape of \(X = (batch\_size, n\_features) = (10,512)\)
Shape of \(W = (n_neurons, n_features) = (5, 512)\) → \(W^T = (512, 5)\)
Shape of output = (batch_size, n_neurons) = (10, 5) → these 5 neurons will become features for the next layer.
shape of b = (n_neurons) = (5) → bias for each neuron. Will be broadcasted to (10,5)(to all 10 samples)

(Yet to complete the above example with the RMSNorm layer.)

Model	Model Size	Params Dimension	n Heads	n Layers	Learning Rate	Batch Size	n Tokens	context length
Llama 1	7B	4096	32	32	3.0e-4	4M	1.0T	2k
	13B	5120	40	40	3.0e-4	4M	1.0T	2k
	33B	6656	52	60	1.5e-4	4M	1.4T	2k
	65B	8192	64	80	1.5e-4	4M	1.4T	2k
Original Transformer
Base	65M	512	8	6
Big	213M	1024	16	6
Llama 2	7B				3.0e-4		2.0T	4k
	13B				3.0e-4		2.0T	4k
	33B				1.5e-4		2.0T	4k
	70B				1.5e-4		2.0T	4k

* n Tokens: Number of tokens in the training dataset