Begin
Causal self-attention is the mechanism underpinning most of the advances in AI since 2017. In this article, I will step through the computation and hopefully gain a better intuition of how it works.
At a high level, this function takes one sequence and transforms it into another. A sequence is a list of token embeddings, a tensor of shape , where is the input sequence length and is the embedding dimension. Each row of this matrix corresponds to one input token, which is represented as a -dimensional vector.
So why then, are there 3 inputs to ? This is because, in the Transformer architecture, the input sequence is projected by 3 different linear layers. If is the input sequence,
where are . So, are simply different representations of the same input sequence.
Let’s compute step-by-step. First, we do , which is a by dot product, resulting in an output. What does this do?
The result of is a scalar ( dot ), and it is the vector dot-product between and . If we remember the formula
we see that the dot-product is positive when , the angle between and , is close to 0º and negative when the angle is 180º, or when they point in opposite directions. We can interpret the dot product as a similarity metric, where positive values indicate similar vectors, and negative values indicate the opposite.
So our final matrix is filled with similarity scores between every pair of and tokens. The result is divided by to prevent the variance from exploding for large embedding dimensions. See Appendix for details.
The next step is to apply the function, which sets all values that are not in the lower-triangular section of the input matrix to .
To this, we apply , which converts each row of values in the matrix into a probability distribution. The function is defined as a mapping from , where the th output element is given by
Two things to note here:
- The sum of all output elements is , as is expected for a probability distribution
- If an input element is , then
After applying the function to the masked similarity scores, we obtain:
Where the entries are defined as:
The resulting matrix has probability distribution rows of length . The final step is to map our value matrix by these probability distributions to give us our new sequence.
Note that is a scalar, and is a embedding vector. Visually, we observe that SelfAttention is selectively combining Value tokens, weighted by a probability distribution generated by how well the queries and keys attend to each other, i.e. have a large inner product. We also see the weight of an output token at index is dependent only on the input tokens with index , due to the causal mask we applied earlier. This is based on the causal assumption, that the an output token does not depend on future tokens, which is required when training autoregressive (i.e. next token prediction) models.
Hopefully you found this helpful!
Appendix
Why Scale by ?
We do this to keep the variance from exploding as increases.
Assume that and i.i.d. Let’s compute the mean and variance of the unscaled .
The mean is trivially zero:
And the variance is:
because
which is for (since and are i.i.d). For ,
since .
So if we scale by , our new variance is
as desired.
Multi-Head Attention
Most modern systems use multi-head attention, which computes in parallel over several “heads”. We usually let , where is the number of heads.