The Mechanics of Causal Self Attention

November 13, 2024

Begin

Causal self-attention is the mechanism underpinning most of the advances in AI since 2017. In this article, I will step through the computation and hopefully gain a better intuition of how it works.

SelfAttention (Q, K, V) = softmax (mask (\frac{Q K ^{T}}{d})) V

At a high level, this function takes one sequence and transforms it into another. A sequence is a list of token embeddings, a tensor of shape $L \times d$ , where $L$ is the input sequence length and $d$ is the embedding dimension. Each row of this matrix corresponds to one input token, which is represented as a $d$ -dimensional vector.

So why then, are there 3 inputs to $SelfAttention$ ? This is because, in the Transformer architecture, the input sequence is projected by 3 different $d \times d$ linear layers. If $X$ is the input sequence,

Q = X W_{Q}, K = X W_{K}, V = X W_{V}

where $W_{Q}, W_{K}, W_{V}$ are $d \times d$ . So, $Q, K, V$ are simply different representations of the same input sequence.

Let’s compute $SelfAttention$ step-by-step. First, we do $Q K^{T}$ , which is a $L \times d$ by $d \times L$ dot product, resulting in an $L \times L$ output. What does this do?

Q K^{T} = q_{1} q_{2} ⋮ q_{L} [k_{1}^{T} k_{2}^{T} \dots k_{L}^{T}] = q_{1} k_{1}^{T} q_{2} k_{1}^{T} ⋮ q_{L} k_{1}^{T} q_{1} k_{2}^{T} q_{2} k_{2}^{T} ⋮ q_{L} k_{2}^{T} \dots \dots ⋱ \dots q_{1} k_{L}^{T} q_{2} k_{L}^{T} ⋮ q_{L} k_{L}^{T}

The result of $q_{i} k_{j}^{T}$ is a scalar ( $1 \times d$ dot $d \times 1$ ), and it is the vector dot-product between $q_{i}$ and $k_{j}$ . If we remember the formula

a \cdot b = ∥ a ∥∥ b ∥ cos θ

we see that the dot-product is positive when $θ$ , the angle between $a$ and $b$ , is close to 0º and negative when the angle is 180º, or when they point in opposite directions. We can interpret the dot product as a similarity metric, where positive values indicate similar vectors, and negative values indicate the opposite.

So our final $L \times L$ matrix is filled with similarity scores between every pair of $q$ and $k$ tokens. The result is divided by $d$ to prevent the variance from exploding for large embedding dimensions. See Appendix for details.

The next step is to apply the $mask$ function, which sets all values that are not in the lower-triangular section of the input matrix to $- \infty$ .

mask (\frac{1}{d} Q K^{T}) = \frac{1}{d} q_{1} k_{1}^{T} q_{2} k_{1}^{T} q_{3} k_{1}^{T} ⋮ q_{L} k_{1}^{T} - \infty q_{2} k_{2}^{T} q_{3} k_{2}^{T} ⋮ q_{L} k_{2}^{T} - \infty - \infty q_{3} k_{3}^{T} ⋮ q_{L} k_{3}^{T} \dots \dots \dots ⋱ \dots - \infty - \infty - \infty ⋮ q_{L} k_{L}^{T}

To this, we apply $softmax$ , which converts each row of values in the matrix into a probability distribution. The function is defined as a mapping from $R^{L} \to R^{L}$ , where the $i$ th output element is given by

softmax (x)_{i} = \frac{e ^{x_{i}}}{\sum _{j = 1}^{L} e ^{x_{j}}} for i = 1, 2, \dots, L

Two things to note here:

The sum of all output elements is $1$ , as is expected for a probability distribution
If an input element $x_{i}$ is $- \infty$ , then $softmax (x)_{i} = 0$

After applying the $softmax$ function to the masked similarity scores, we obtain:

S = softmax (mask (\frac{1}{d} Q K^{T})) = S_{1, 1} S_{2, 1} S_{3, 1} ⋮ S_{L, 1} 0 S_{2, 2} S_{3, 2} ⋮ S_{L, 2} 00 S_{3, 3} ⋮ S_{L, 3} \dots \dots \dots ⋱ \dots 000 ⋮ S_{L, L}

Where the entries $S_{i, j}$ are defined as:

S_{i, j} = \frac{e ^{mask (\frac{Q K ^{T}}{d})_{i, j}}}{\sum _{k = 1}^{L} e ^{mask (\frac{Q K ^{T}}{d})_{i, k}}}

The resulting matrix $S$ has probability distribution rows of length $L$ . The final step is to map our value matrix $V$ by these probability distributions to give us our new sequence.

SelfAttention (Q, K, V) = SV = S_{1, 1} S_{2, 1} S_{3, 1} ⋮ S_{L, 1} 0 S_{2, 2} S_{3, 2} ⋮ S_{L, 2} 00 S_{3, 3} ⋮ S_{L, 3} \dots \dots \dots ⋱ \dots 000 ⋮ S_{L, L} V_{1} V_{2} V_{3} ⋮ V_{L} = S_{1, 1} V_{1} S_{2, 1} V_{1} + S_{2, 2} V_{2} S_{3, 1} V_{1} + S_{3, 2} V_{2} + S_{3, 3} V_{3} ⋮ S_{L, 1} V_{1} + S_{L, 2} V_{2} + \dots + S_{L, L} V_{L}

Note that $S_{i, j}$ is a scalar, and $V_{k}$ is a $1 \times d$ embedding vector. Visually, we observe that SelfAttention is selectively combining Value tokens, weighted by a probability distribution generated by how well the queries and keys attend to each other, i.e. have a large inner product. We also see the weight of an output token at index $i$ is dependent only on the input tokens with index $\leq i$ , due to the causal mask we applied earlier. This is based on the causal assumption, that the an output token $O_{i}$ does not depend on future tokens, which is required when training autoregressive (i.e. next token prediction) models.

Hopefully you found this helpful!

Appendix

Why Scale by $d$ ?

We do this to keep the variance from exploding as $d$ increases.

Assume that $q_{i}, k_{i} \sim N (μ = 0, σ^{2} = 1)$ and i.i.d. Let’s compute the mean and variance of the unscaled $s = q \cdot k$ .

The mean is trivially zero:

E [s] = E [i = 1 \sum d q_{i} k_{i}] = i = 1 \sum d E [q_{i} k_{i}] = i = 1 \sum d E [q_{i}] E [k_{i}] = 0

And the variance is:

Var (s) = E [s^{2}] - (E [s])^{2} = E [s^{2}] = d

because

E [s^{2}] = E [i = 1 \sum d j = 1 \sum d q_{i} k_{i} q_{j} k_{j}] = i = 1 \sum d j = 1 \sum d E [q_{i} k_{i} q_{j} k_{j}]

which is $0$ for $i \neq = j$ (since $q_{i}, q_{j}$ and $k_{i}, k_{j}$ are i.i.d). For $i = j$ ,

i = 1 \sum d E [q_{i}^{2} k_{i}^{2}] = i = 1 \sum d E [q_{i}^{2}] E [k_{i}^{2}] = i = 1 \sum d 1 \cdot 1 = d

since $E [q_{i}^{2}] = E [k_{i}^{2}] = σ^{2} = 1$ .

So if we scale by $1/ d$ , our new variance is

Var (\frac{s}{d}) = \frac{1}{d} Var (s) = 1

as desired.

Multi-Head Attention

Most modern systems use multi-head attention, which computes $SelfAttention$ in parallel over several “heads”. We usually let $d_{k} = d_{v} = d_{model} / H$ , where $H$ is the number of heads.

Q_{h} K_{h} V_{h} = X W_{h}^{Q} = X W_{h}^{K} = X W_{h}^{V} W_{h}^{Q} \in R^{d_{model} \times d_{k}} W_{h}^{K} \in R^{d_{model} \times d_{k}} W_{h}^{V} \in R^{d_{model} \times d_{v}}

head_{h} = SelfAttention (Q_{h}, K_{h}, V_{h}) = softmax (mask (\frac{Q _{h} K _{h}^{T}}{d _{k}})) V_{h}

MultiHead (Q, K, V) = Concat (head_{1}, head_{2}, \dots, head_{H})

←

An Expert–Level 2048 Bot

Local Approximation

→

The Mechanics of Causal Self Attention

Begin

Appendix

Why Scale by d​?

Multi-Head Attention

Why Scale by $d$ ?