Gall’s Law
A complex system that works is invariably found to have evolved from a simple
system that worked
John Gall
This post walks you through the step-by-step discovery of state-of-the-art positional encoding in transformer models. We will achieve
this by iteratively improving our approach to encoding position, arriving at Rotary Postional Encoding (RoPE) used in the latest LLama 3.2 release and most modern transformers. This post intends to limit the mathematical knowledge required to follow along, but some basic linear algebra, trigonometry and understanding of self attention is expected.
Problem Statement
You shall know a word by the company it keeps
John Rupert Firth
As with all problems, it is best to first start with understanding exactly what we are trying to achieve. The self attention mechanism in transformers is utilized to understand relationships
between tokens in a sequence. Self attention is a set operation, which
means it is permutation equivariant. If we do not
enrich self attention with positional information, many important relationships are
incapable of being determined.
This is best demonstrated by example.
Motivating Example
Consider this sentence with the same word in different positions:
The dog chased another dog
\text{The dog chased another dog}
Intuitively, “dog” refers to two different entities. Let’s see what happens if we first tokenize them, map to the real token embeddings of Llama 3.2 1B and pass them through torch.nn.MultiheadAttention.
import torch
import torch.nn as nn
from transformers import AutoTokenizer, AutoModel
model_id = “meta-llama/Llama-3.2-1B”
tok = AutoTokenizer.from_pretrained(model_id)
model = AutoModel.from_pretrained(model_id)
text = “The dog chased another dog”
tokens = tok(text, return_tensors=“pt”)(“input_ids”)
embeddings = model.embed_tokens(tokens)
hdim = embeddings.shape(-1)
W_q = nn.Linear(hdim, hdim, bias=False)
W_k = nn.Linear(hdim, hdim, bias=False)
W_v = nn.Linear(hdim, hdim, bias=False)
mha = nn.MultiheadAttention(embed_dim=hdim, num_heads=4, batch_first=True)
with torch.no_grad():
for param in mha.parameters():
nn.init.normal_(param, std=0.1)
output, _ = mha(W_q(embeddings), W_k(embeddings), W_v(embeddings))
dog1_out = output(0, 2)
dog2_out = output(0, 5)
print(f”Dog output identical?: {torch.allclose(dog1_out, dog2_out, atol=1e-6)}“)
As we can see, without any positional information, the output of a (multi
headed) self attention operation is identical for the same token in
different positions, despite the tokens clearly representing distinct entities. Let’s begin designing a method of enhancing self attention with positional information, such that it can determine relationships between words encoded by
their positions.
How should an ideal positional encoding scheme behave?
Desirable Properties
Let’s try and define some desirable properties that will make the optimization
process as easy as possible.
Property 1 – Unique encoding for each position (across sequences)
Each position needs a unique encoding that remains consistent regardless of sequence length – a token at position 5 should have the same encoding whether the current sequence is of length 10 or 10,000.
Property 2 – Linear relation between two encoded positions
The relationship between positions should be mathematically simple. If we know the encoding for position pp, it should be straightforward to compute the encoding for position p+kp+k, making it easier for the model to learn positional patterns.
If you think about how we represent numbers on a number line, it’s easy to understand that 5 is 2 steps away from 3, or that 10 is 5 steps from 15. The same intuitive relationship should exist in our encodings.
Property 3 – Generalizes to longer sequences than those encountered in training
To increase our models’ utility in the real world, they should generalize outside
their training distribution. Therefore, our encoding scheme needs to be
adaptable enough to handle unexpected input lengths, without
violating any of our other desirable properties.
Property 4 – Generated by a deterministic process the model can learn
It would be ideal if our positional encodings could be drawn from a
deterministic process. This should allow the model to learn the mechanism
behind our encoding scheme efficiently.
Property 5 – Extensible to multiple dimensions
With multimodal models becoming the norm, it is crucial that our positional
encoding scheme can naturally extend from 1D1D to nDnD. This will allow models to consume data like images or brain scans, which are 2D2D and 4D4D
respectively.
Now we know the ideal properties (henceforth referred to as PrnPr_n), let’s start designing and iterating on our encoding scheme.
Integer Position Encoding
The first approach that may jump to mind is simply to add the integer value of the token position to each component of the token embedding, with values ranging from 0→L0 \rightarrow L where LL is the
length of our current sequence.
In the above animation, we create our positional encoding vector for the token chased\color{#699C52}\text{chased} from the index and add it to our token embedding. The embedding values here are a subset of the real values from Llama 3.2 1B. We can observe that they’re clustered around 0. This
is desirable to avoid vanishing or exploding gradients during training and therefore is something we’d like to maintain throughout the model.
It’s clear that our current naïve approach is going to cause problems. The magnitude of the position value
vastly exceeds the actual values of our input. This means the signal-to-noise
ratio is very low, and it’s hard for the model to separate the semantic
information from the positional information.
With this new knowledge, a natural follow on might be to normalize the position value by 1N\frac{1}{N}. This constrains the values between 0 and 1, but introduces another problem. If we choose NN to be the length of the current sequence, then the position values will be completely different for each sequence of differing lengths, violating Pr1Pr_1.
Is there a better way to ensure our numbers are between 0 and 1?
If we thought really hard about this for a while, we might come up with switching
from decimal to binary numbers.
Binary Position Encoding
Instead of adding our (potentially normalized) integer position to each
component of the embedding, we could instead convert it into its binary
representation and s t r e t c h our value out to match our embedding dimension, as demonstrated below.
We’ve converted the position of interest (252) into its binary representation
(11111100) and added each bit to the corresponding component of the
token embedding. The least significant bit (LSB) will cycle between 0 and 1 for every
subsequent token, whilst the most significant bit (MSB) will cycle every 2n−12^{n-1} tokens where nn is the number of bits.
You can see the positional encoding vector for different indices in the animation below (1)(^1).
We’ve solved the value range problem, and we now have unique encodings that are
consistent across different sequence lengths. What happens if we plot a low dimensional version of our token embedding and visualize the addition of our binary positional vector for different values.
We can see that the result is very “jumpy” (as we might expect from the
discrete nature of binary). The optimization process likes smooth, continuous and
predictable changes. Do we know any functions with similar value ranges that are smooth and continuous?
If we looked around a little, we might notice that both sin\sin and cos\cos fit the bill!
Sinusoidal positional encoding
The above animation visualizes our position embedding if each component is
alternatively drawn from sin\sin and cos\cos with gradually increasing
wavelengths. If you compare it with the previous animation, you’ll notice a striking similarity!
We’ve now arrived at Sinusoidal embeddings; originally defined in the Attention is all you need paper.
Let’s look at the equations:
PE(pos,2i)=sin(pos100002i/d)PE(pos,2i+1)=cos(pos100002i/d)
PE_{(pos,2i)} = \color{#58C4DD}\sin\left(\color{black}\frac{pos}{10000^{2i/d}}\color{#58C4DD}\right)\color{black} \\
\quad \\
PE_{(pos,2i+1)} = \color{#FC6255}\cos\left(\color{black}\frac{pos}{10000^{2i/d}}\color{#FC6255}\right)\color{black} \\
where pospos is the tokens position index, ii is the component index
in the positional encoding vector, and dd is the model dimension. 10,00010,000 is the base wavelength (henceforth referred to as θ\theta), which we stretch or compress as a function of the component index. I encourage you to plug in some realistic values to get a feel for this
geometric progression.
There’s a few parts of this equation that are confusing at first glance. How did the
authors choose 10,00010,000? Why are we using sin\sin and cos\cos for even and odd positions respectively?
It seems that using 10,00010,000 for the base wavelength was determined experimentally (2)(^2). Deciphering the usage of both sin\sin and cos\cos is more involved, but crucial
for our iterative approach to understanding. The key here is our desire for a linear relation between two encoded positions Pr2Pr_2. To understand how using sin\sin and cos\cos in tandem produce this linear relation, we will have to dive into some trigonometry.
Consider a sequence of sine and cosine pairs, each associated with a frequency ωi\omega_i. Our goal is to find a linear transformation matrix M\mathbf{M} that can shift these sinusoidal functions by a fixed offset kk:
M⋅(sin(ωip)cos(ωip))=(sin(ωi(p+k))cos(ωi(p+k)))
\mathbf{M} \cdot \begin{bmatrix} \sin(\omega_i p) \\ \cos(\omega_i p) \end{bmatrix} = \begin{bmatrix} \sin(\omega_i(p + k)) \\ \cos(\omega_i(p + k)) \end{bmatrix}
The frequencies ωi\omega_i follow a geometric progression that decreases with dimension index ii, defined as:
ωi=1100002i/d
\omega_i = \frac{1}{10000^{2i/d}}
To find this transformation matrix, we can express it as a general 2×2 matrix with unknown coefficients u1u_1, v1v_1, u2u_2, and v2v_2:
(u1v1u2v2)⋅(sin(ωip)cos(ωip))=(sin(ωi(p+k))cos(ωi(p+k)))
\begin{bmatrix} u_1 & v_1 \\ u_2 & v_2 \end{bmatrix} \cdot \begin{bmatrix} \sin(\omega_i p) \\ \cos(\omega_i p) \end{bmatrix} = \begin{bmatrix} \sin(\omega_i(p+k)) \\ \cos(\omega_i(p+k)) \end{bmatrix}
By applying the trigonometric addition theorem to the right-hand side, we can expand this into:
(u1v1u2v2)⋅(sin(ωip)cos(ωip))=(sin(ωip)cos(ωik)+cos(ωip)sin(ωik)cos(ωip)cos(ωik)−sin(ωip)sin(ωik))
\begin{bmatrix} u_1 & v_1 \\ u_2 & v_2 \end{bmatrix} \cdot \begin{bmatrix} \sin(\omega_i p) \\ \cos(\omega_i p) \end{bmatrix} = \begin{bmatrix} \sin(\omega_i p)\cos(\omega_i k) + \cos(\omega_i p)\sin(\omega_i k) \\ \cos(\omega_i p)\cos(\omega_i k) – \sin(\omega_i p)\sin(\omega_i k) \end{bmatrix}
This expansion gives us a system of two equations by matching coefficients:
u1sin(ωip)+v1cos(ωip)=cos(ωik)sin(ωip)+sin(ωik)cos(ωip)u2sin(ωip)+v2cos(ωip)=−sin(ωik)sin(ωip)+cos(ωik)cos(ωip)
\begin{align}
u_1\sin(\omega_i p) + v_1\cos(\omega_i p) &= \cos(\omega_i k)\sin(\omega_i p) + \sin(\omega_i k)\cos(\omega_i p) \\
u_2\sin(\omega_i p) + v_2\cos(\omega_i p) &= -\sin(\omega_i k)\sin(\omega_i p) + \cos(\omega_i k)\cos(\omega_i p)
\end{align}
By comparing terms with sin(ωip)\sin(\omega_i p) and cos(ωip)\cos(\omega_i p) on both sides, we can solve for the unknown coefficients:
u1=cos(ωik)v1=sin(ωik)u2=−sin(ωik)v2=cos(ωik)
\begin{align}
u_1 &= \cos(\omega_i k) & v_1 &= \sin(\omega_i k) \\
u_2 &= -\sin(\omega_i k) & v_2 &= \cos(\omega_i k)
\end{align}
These solutions give us our final transformation matrix Mk\mathbf{M_k}:
Mk=(cos(ωik)sin(ωik)−sin(ωik)cos(ωik))
\mathbf{M_k} = \begin{bmatrix} \cos(\omega_i k) & \sin(\omega_i k) \\ -\sin(\omega_i k) & \cos(\omega_i k) \end{bmatrix}
If you’ve done any game programming before, you might notice that the
result of our derivation is oddly familiar. That’s right, it’s the Rotation Matrix! (3)(^3).
So the encoding scheme designed by Noam Shazeer in Attention is all you need was already encoding relative position as a rotation back in 2017! It took another 4 years to go from Sinusoidal Encoding to RoPE, despite rotations already being on the table…
Absolute vs Relative Position Encoding
With the knowledge in hand that rotations are important here, let’s
return to our motivating example and try to discover some intuitions for our next iteration.
01234The dog chased another dog-2-1012The dog chased another dog
\begin{align*}
&\hspace{0.7em}0 \hspace{1.4em} 1 \hspace{2em} 2 \hspace{2.6em} 3 \hspace{2.4em} 4\\
&\text{The dog chased another dog} \\
\\
&\hspace{0.3em}\text{-2} \hspace{1.4em} \text{-1} \hspace{1.7em} 0 \hspace{2.6em} 1 \hspace{2.4em} 2\\
&\text{The dog \color{#699C52}chased \color{black}another dog}
\end{align*}
Above we can see the absolute positions of our tokens, and the relative
positions from chased\color{#699C52}\text{chased} to every other token. With Sinusoidal Encoding, we
generated a separate vector which represents the absolute position,
and using some trigonometric trickery we were able to encode relative positions.
When we’re trying to understand these sentences, does it matter that this word is the 2157th word in this blog post? Or do we care about its relationship to the words around it? The absolute position of a word rarely matters for meaning – what matters is how words relate to each other.
Positional encoding in context
From this point on, it’s key to consider positional encoding in the context of
self attention. To reiterate, the self-attention mechanism enables the model to weigh the importance of different elements in an input sequence and dynamically adjust their influence on the output.
Attn(Q,K,V)=softmax(QKTdk)V
\text{Attn}(Q, K, V) = \text{softmax}\left(\frac{QK^T}{\sqrt{d_k}}\right)V
In all our previous iterations, we’ve generated a separate positional encoding
vector and added it to our token embedding prior to our QQ, KK and VV projections.
By adding the positional information directly to our token embedding, we are
polluting the semantic information with the positional information. We should
be attempting to encode the information without modifying the norm. Shifting to multiplicative is the
key.
Using the dictionary analogy, when looking up a word (query) in our dictionary (keys), nearby words should have more influence than distant ones. The influence of one token upon another is determined by the QKTQK^T dot product – so that’s exactly where we should focus our positional encoding!
a⃗⋅b⃗=∣a⃗∣∣b⃗∣cosθ
\vec{a} \cdot \vec{b} = |\vec{a}| |\vec{b}| \cos \theta
The geometric interpretation of the dot product shown above gives us a magnificent insight.
We can modulate the result of our dot product of two vectors purely by
increasing or decreasing the angle between them. Furthermore, by rotating the
vector, we have absolutely zero impact on the norm of the vector, which encodes
the semantic information of our token.
So now we know where to focus our attention, and have seen from another angle why a
rotation might be a sensible “channel” in which to encode our positional
information, let’s put it all together!
Rotary Postional Encoding
Rotary Postional Encoding or RoPE was defined in the
RoFormer paper (Jianlin Su designed it independently on his blog here and here).
While it may seem like voodoo if you skip to the end result, by thinking about Sinusoidal Encoding in the
context of self attention (and more specifically dot products), we can see how
it all comes together.
Much like in Sinusoidal Encoding, we decompose our vectors q\mathbf{q} or k\mathbf{k}, instead of pre-projection x\mathbf{x}) into 2D pairs/chunks. Rather than encoding absolute position directly by adding a vector we drew from sinusoidal functions of slowly decreasing frequencies, we cut to the chase and encode relative position by multiplying each pair with the rotation matrix.
Let q\mathbf{q} or k\mathbf{k} be our input vector at position pp. We create a block diagonal matrix
where Mi\mathbf{M_i} is the corresponding rotation matrix for that component
pairs desired rotation:
R(q,p)=(M1M2⋱Md/2)(q1q2⋮qd)
R(\mathbf{q}, p) = \begin{pmatrix} \mathbf{M_1} & & & \\ & \mathbf{M_2} & & \\ & & \ddots & \\ & & & \mathbf{M_{d/2}} \end{pmatrix} \begin{pmatrix} q_1 \\ q_2 \\ \vdots \\ q_d \end{pmatrix}
Much the same as Sinusoidal Encoding, Mi\mathbf{M_i} is simply:
Mi=(cos(ωip)sin(ωip)−sin(ωip)cos(ωip))
\mathbf{M_i} = \begin{bmatrix} \cos(\omega_i p) & \sin(\omega_i p) \\ -\sin(\omega_i p) & \cos(\omega_i p) \end{bmatrix}
In practice, we don’t use a matrix multiplication to compute RoPE as it would be
computationally inefficient with such a sparse matrix. Instead, we can directly apply the rotations to pairs of elements independently, taking advantage of the regular pattern in the computation:
RΘ,pdq=(q1q2q3q4⋮qd−1qd)⊗(cospθ1cospθ1cospθ2cospθ2⋮cospθd/2cospθd/2)+(−q2q1−q4q3⋮−qdqd−1)⊗(sinpθ1sinpθ1sinpθ2sinpθ2⋮sinpθd/2sinpθd/2)
R_{\Theta,p}^d q = \begin{pmatrix}
q_1 \\
q_2 \\
q_3 \\
q_4 \\
\vdots \\
q_{d-1} \\
q_d
\end{pmatrix} \otimes \begin{pmatrix}
\cos p\theta_1 \\
\cos p\theta_1 \\
\cos p\theta_2 \\
\cos p\theta_2 \\
\vdots \\
\cos p\theta_{d/2} \\
\cos p\theta_{d/2}
\end{pmatrix} + \begin{pmatrix}
-q_2 \\
q_1 \\
-q_4 \\
q_3 \\
\vdots \\
-q_d \\
q_{d-1}
\end{pmatrix} \otimes \begin{pmatrix}
\sin p\theta_1 \\
\sin p\theta_1 \\
\sin p\theta_2 \\
\sin p\theta_2 \\
\vdots \\
\sin p\theta_{d/2} \\
\sin p\theta_{d/2}
\end{pmatrix}
That’s all there is to it! By artfully applying our rotations to 2D chunks of q\mathbf{q} and k\mathbf{k} prior to their dot product, and switching from additive to
multiplicative, we can gain a big performance boost in evaluations (4)(^4).
Extending RoPE to nn-Dimensions
We’ve explored the 1D1D case for RoPE and by this point I hope you’ve gained an
intuitive understanding of an admittedly unintuitive component of transformers.
Finally, let’s explore extending it to higher dimensions, such as images.
A natural first intuition could be to directly use the (xy) \begin{bmatrix} x \\ y \end{bmatrix} coordinate pairs from the image. This might seem intuitive, after all, we were almost arbitrarily pairing up our components previously. However, this would be a mistake!
In the 1D1D case, we encode the relative position m−nm – n through a rotation of pairs
of values from our input vector. For 2D2D data, we need to encode both horizontal and vertical relative positions, say m−nm – n and i−ji – j independently. RoPE’s brilliance lies in how it handles multiple dimensions. Instead of trying to encode all positional information in a single rotation, we pair components within the same dimension and rotate those, otherwise we would be intermixing the xx and yy offset information. By handling each dimension independently, we maintain the natural structure of the space. This can generalize to as many dimensions as required!
The future of positional encoding
Is RoPE the final incarnation of positional encoding? This recent paper from DeepMind deeply analyses RoPE and highlights some fundamental problems. TLDR: RoPE isn’t a perfect solution, and the models mostly focus on the lower frequencies and the rotation for a certain percent of low frequencies improves performance on Gemma 2B!
I anticipate some future breakthroughs, perhaps taking inspiration from
signal processing with ideas like wavelets or hierarchical implementations. As models
are increasingly quantized for deployment, I’d also expect to see some
innovation in encoding schemes that remain robust under low-precision arithmetic.
Conclusion
Positional encoding has and continues to be treated as an after thought in
transformers. I believe we should view it differently – self attention has an
Achilles heel that has been repeatedly patched.
I hope this blog post showed you that you too could have discovered state of the
art positional encoding, despite it being unintuitive at first. In a follow up
post I’d love to explore practical implementation details for RoPE in order to
maximise performance.
This post was originally published here.
References
(^1): Binary and Sinusoidal animations are reproductions of animations contained
in this video.
(^2): Using θ=10000\theta = 10000 gives us 2π⋅10000 2 \pi \cdot 10000 unique positions, or a
theoretical upper bound on the context length at ~63,000.
(^3): Pieces of this post are based on this fantastic
post
by Amirhossein Kazemnejad.
(^4): For empirical evidence, see this great post by EleutherAI.