AI architecture guide

Decoder-only Transformer, step by step

A static visual walkthrough for understanding how an LLM turns text into the next token. Read it left to right: tokens become vectors, vectors exchange context through attention, repeated blocks refine the state, and the final vector becomes vocabulary scores.

visual guide
Mental model

The model keeps rewriting one vector per token.

At every layer, each token position has a numeric vector. Attention lets that vector pull information from earlier tokens. The MLP then reshapes the vector's internal features. After many repeats, the last visible token is used to predict what comes next.

1 Weights are learnedEmbedding tables, Q/K/V matrices, MLP matrices, and the output matrix are trained parameters.
2 Activations are temporaryAttention scores, hidden vectors, and logits are created for the current input while the model runs.
3 Generation is repeatedThe model predicts one next token, appends it, and runs the same process again.
Example sequence The focus token can only use itself and earlier context.
<bos>pos 1
Thepos 2
dogpos 3
chasedpos 4
thepos 5
catpos 6
becausefocus
itfuture
visible
visible
visible
visible
visible
visible
current
masked

What happens to the focus token vector?

The same position keeps one hidden state, but every block rewrites it with more context.

startlearned embedding for "because"
attentionweighs every visible earlier token
MLPreshapes internal features
outputscores likely next tokens
T 1. Tokens

Text is split into token IDs. The model sees numbers, not words.

E 2. Embeddings

Each ID becomes a learned vector, plus position information.

QKV 3. Q, K, V

Linear projections create search, address, and content vectors.

A 4. Attention

The focus token mixes useful information from prior tokens.

MLP 5. Residual + MLP

The update is added back, normalized, then transformed per token.

R 6. Repeat

Many layers refine the sequence representation.

NT 7. Next token

The final vector becomes scores over the vocabulary.

Hard-coded walkthrough

Each stage as a visual card

The cards use the same example token, "because", so the path stays concrete. On wide screens the cards form a left-to-right learning flow; on smaller screens they stack vertically.

Step 1 token_id + position -> x

Tokens become vectors

The token ID points into an embedding table. The model retrieves one learned vector and combines it with position information, so "because at position 7" is represented differently from the same token elsewhere.

391
x1x2x3x4x5x6x7x8...
token embedding
+
position 7
What to notice

The vector is not a dictionary definition. It is a learned numeric starting point that later layers keep rewriting.

Step 2 Q = XWq, K = XWk, V = XWv

The vector is projected into Q, K, and V

Each token vector is multiplied by three trained matrices. These create three views of the same token: what it is looking for, how other tokens can match it, and what information it can contribute.

QQuery: what this position wants to find.
KKey: how this position advertises itself.
VValue: the content this position can pass forward.
What to notice

Q, K, and V are activations. The matrices Wq, Wk, and Wv are the learned model weights.

Step 3 softmax(QK^T / sqrt(d))

Attention chooses useful context

The focus query is compared with earlier keys. Softmax turns the scores into weights, then those weights mix the value vectors into one context update for the focus token.

Thedogchasedthecatbecause
What to notice

Attention is not a permanent memory. It is computed fresh for the current prompt and current layer.

Step 4 x + Attention(Norm(x))

A causal mask blocks future tokens

Decoder-only LLMs predict the next token, so a position cannot inspect future positions. The focus token can use the left side of the sentence, but not tokens that have not been generated yet.

What to notice

The triangle shape is the reason a decoder can be trained to predict text from left to right.

Step 5 MLP(x) = gate * up -> down

Residual paths preserve and refine

The attention update is added back to the incoming vector. Then the MLP expands, gates, and compresses each token independently. Attention mixes across tokens; the MLP reshapes features inside each token.

incoming
token state
+
attention
update
=
refined
state
in
wide
What to notice

Residual connections make deep stacks trainable because each layer can add a correction instead of replacing everything.

Step 6 block_1 -> block_2 -> ... -> block_n

The block repeats many times

A modern LLM stacks many decoder blocks. Every block repeats the same pattern, but each block has its own learned weights and can refine the representation in a different way.

1local word and phrase patterns start to form.
8relationships across the sentence become easier to represent.
16features are repeatedly mixed, normalized, and refined.
32the final state contains the context used for prediction.
What to notice

Depth is repeated refinement. The sequence does not move to one single vector; every token position keeps its own state.

Step 7 logits = h_last W_vocab

The final vector scores the vocabulary

The last-layer vector at the current position is multiplied by an output matrix. The result is one score per possible next token. Sampling or decoding policy chooses the next token from those scores.

was0.31
ran0.19
barked0.14
jumped0.10
slept0.07
What to notice

An LLM does not produce a whole answer in one pass. It predicts one token, appends it, and repeats the process.

Compact formula

One decoder block

z = x + Attention(RMSNorm(x))

y = z + MLP(RMSNorm(z))

The same shape repeats many times. Normalization stabilizes the values, residual paths preserve information, and each sublayer adds a learned update.

Do not mix up

Weights vs. runtime values

Learned model weights Runtime values
embeddings Wq, Wk, Wv MLP weights
hidden states attention weights logits
Big picture

Why the architecture works

Attention gives each position controlled access to prior context. The MLP changes the internal feature representation. Repeating the block lets simple token vectors become rich context-sensitive vectors that can predict the next token.