Position Encoding Methods in Transformers

Why do we need Position Encodings in Transformers?

In the Transformer architecture, the model processes the input data in parallel. This parallel processing is a significant advantage over sequential models like RNNs and LSTMs. However, this parallel processing also means that the model has no inherent understanding of the order of the input data.

To address this issue, the Transformer model uses Positional Encodings. Positional Encodings are added to the input embeddings to provide information about the position of each token in the input sequence. This allows the model to learn the relative positions of tokens in the input sequence and capture the sequential information.

Transformer models are order-invariant by default.
Preserving order information requires adding positional embeddings.
Two main types: absolute and relative positional embeddings.

Absolute Positional Embeddings

\[f_{t:t∈{q,k,v}}(x_i, i) := W_{t:t∈{q,k,v}}(x_i + p_i)\]

Where:

\(f_{t:t∈{q,k,v}}\) is the function for query, key, and value.
\(x_i\) is the input embedding for token \(i\).
\(p_i\) is the positional encoding for token \(i\).
\(W_{t:t∈{q,k,v}}\) is the weight matrix for query, key, and value.
\(t:t∈{q,k,v}\) denotes the query, key, and value.
\(i\) is the token index. (i.e., the position of the token in the input sequence)

Above equation shows how the positional encoding(\(p_i\)) is added to the input embeddings (\(x_i\)) for the query, key, and value vectors in the self-attention mechanism of the Transformer model. The positional encoding is added to the input embeddings before they are passed through the linear transformation to generate the query, key, and value vectors. This allows the model to learn the position of each token in the input sequence and capture the sequential information.

Represent each word's absolute position with a vector. We can then add this vector to the word's embedding to encode the position of the word in the sentence. This is the most common way of adding positional information to the input embeddings in Transformer models.

Two methods to generate:
Learned Positional Embeddings:
- Embedding layer learns position embeddings.
- Position embeddings updated during training.
- Captures position information but requires more parameters.
- Max len is bounded by max_seq_len.
Sinusoidal Positional Encodings: (Explained in the next section)
- Fixed sinusoidal functions used to encode positions.
- Captures position information without additional parameters.
- Used in the original Transformer model.
Empirical comparison:
- Both methods have similar performance in real models.
Issue: Lack of distinction between distant positions.

Types of Position Encodings

There are several types of Positional Encodings used in Transformers. Some of the most common types include:

Sine and Cosine Positional Encodings

The most common type of Positional Encoding used in Transformers is the Sine and Cosine Positional Encoding. This encoding method uses a combination of sine and cosine functions to encode the position of each token in the input sequence.

The formula for the Sine and Cosine Positional Encoding is as follows:

\[PE_{(pos, 2i)} = \sin\left(\frac{pos}{10000^{2i/d_{model}}}\right)\]

\[PE_{(pos, 2i+1)} = \cos\left(\frac{pos}{10000^{2i/d_{model}}}\right)\]
- where:
- \(PE_{(pos, 2i)}\) and \(PE_{(pos, 2i+1)}\) are the positional encodings for position \(pos\) and dimension \(2i\) and \(2i+1\) respectively.
- \(pos\) is the position of the token in the input sequence.
- \(i\) is the dimension of the positional encoding.
- \(d_{model}\) is the dimension of the model.
- \(10000\) is a constant used to scale the positional encodings.
The sine function is used for even dimensions, and the cosine function is used for odd dimensions.
The positional encodings are added to the input embeddings to provide information about the position of each token in the input sequence.
The model learns to attend to the positional encodings during training to capture the sequential information in the input sequence.
The Sine and Cosine Positional Encoding is used in the original Transformer model and has been widely adopted in subsequent Transformer architectures.
The Sine and Cosine Positional Encoding is effective at capturing the relative positions of tokens in the input sequence and has been shown to improve the performance of Transformer models on a variety of tasks.

Relative Positional Encodings

Paper : Self-Attention with Relative Position Representations by Shaw et al. (2018) at Google.

Relative positional encodings are another type of positional encoding used in Transformers. Relative positional encodings capture the relative positions of tokens in the input sequence, rather than the absolute positions. This allows the model to learn the relative positions of tokens and capture the sequential information in the input sequence.

Other Types of Relative Positional Encodings:

Represent positional offsets between tokens.
Many Ways to Implement:
- Learned relative embeddings.
- Bias terms in self-attention mechanism.
- T5 model uses bias to represent relative distances between tokens. Equations:
  
  (Need to add equations here from the paper)

T5 Bias Relative Embeddings:

Advantages:
- Captures relative positions between tokens.
- More efficient than absolute embeddings.
Challenges with relative embeddings:
slower computation. why? - Extra steps in the self-attention mechanism: Adding the positional matrix to the query-key matrix. - Embeddings change for each token. - Difficult to use key-value cache effectively. - Not commonly used in large language models.

Rotary Positional Embeddings

Paper: RoFormer: Enhanced Transformer with Rotary Position Embedding (2023)

Transformer architecture has been stable since the "attention is all you need" paper in 2017.
Minimal architectural changes until 2022.
A new improvement called "rope" (rotary positional embeddings) introduced in 2022, quickly adopted by various language models.

RoPE: - Propose rotating word vectors instead of adding positional vectors. - Rotation angle depends on the word's absolute position. - Preserves advantages of both absolute and relative positional embeddings.

Rotary positional embeddings are designed such that two vectors are rotated by the exact same amount. Therefore, the angle between the vectors will be preserved. This means the dot product between the two vectors will remain the same when we add words to the beginning or end of the sentence as long as the distance between the two words stay the same. So, we can clearly see that rotary positional embeddings have both the advantages of absolute as well as relative positional embeddings.

Matrix Formulation

Rotation expressed as matrix multiplication, but impractical due to computational cost.
Implemented with vector operations: simpler and more efficient.

Furthuer Reading:

Extending the RoPE Rotary Embeddings: A Relative Revolution