Modern language models like Phi-3 or LLaMA are often treated as black boxes — you feed them text and get intelligent answers. But beneath that, they are nothing more than massive, structured matrices of numbers (parameters) performing linear algebra operations in sequence. To truly understand how these models think, we must follow the journey of an input token through every stage of computation — from text to logits — and observe how parameters shape meaning.
1. Tokenization: Text to Numbers
Every input string is first broken into tokens — discrete integer IDs mapped by a vocabulary.
Example:
“Hello world” →
[15496, 995]
Each token ID is an index into an embedding matrix E of shape (vocab_size, embedding_dim).
At the parameter level:
x₀ = E[token_id]
This vector x₀ (say 4096-dimensional) is the model’s numeric representation of the word.
2. Embedding and Positional Encoding
Language models are sequential, so they need to know where each token occurs.
A positional encoding (learned or sinusoidal) is added to each embedding:
x₀ = E[token_id] + P[position]
-
E= learned token embeddings -
P= learned positional embeddings
Both are stored in the model’s parameters and updated during training.
3. Transformer Layers: Parameterized Flow of Information
The input now flows through multiple identical Transformer blocks (e.g., 32–80 layers).
Each layer has two main parts:
-
Multi-Head Self-Attention (MHSA)
-
Feedforward Network (FFN)
Let’s zoom in to the parameter level.
(a) Self-Attention: The Dynamic Router
Each token embedding is linearly projected into three spaces using trainable matrices:
Q = xW_Q
K = xW_K
V = xW_V
Here:
-
W_Q, W_K, W_Vare parameter matrices (each of sized_model × d_head). -
These matrices are what the model learns to detect relationships between tokens.
The attention weights are computed as:
A = softmax(QKᵀ / √d_head)
This determines how much each token should attend to others.
Then the weighted sum of values gives:
z = A × V
The combined attention output passes through another learned projection:
x₁ = zW_O
where W_O is the output projection matrix.
At this point, each token’s representation has mixed information from all other tokens, guided entirely by learned matrices W_Q, W_K, W_V, W_O.
4. Feedforward Network: Nonlinear Transformation
After attention, the model applies a two-layer MLP to each token independently:
h₁ = x₁W₁ + b₁
h₂ = GELU(h₁)
x₂ = h₂W₂ + b₂
-
W₁andW₂are parameter matrices of large size (e.g., 4096×11008). -
This expands and compresses the token’s hidden representation, enabling nonlinear mixing of semantic features.
Each layer updates the representation:
x ← x + LayerNorm(x₂)
residual connections ensure stable gradient flow and memory of previous states.
5. The Final Projection: Turning Thought into Words
After the last transformer block, we obtain a final hidden state h_final for each token.
To predict the next word, we project h_final back to vocabulary space using the same embedding matrix Eᵀ:
logits = h_final × Eᵀ
This gives one score per vocabulary token — the model’s belief in what comes next.
Applying softmax(logits) yields a probability distribution over all words.
6. Sampling: Converting Probabilities to Output
Finally, the model samples (or picks) the next token:
next_token = argmax(softmax(logits))
or via stochastic sampling (temperature, top-k, nucleus sampling).
This new token becomes input for the next iteration — recursively generating text.
7. Where the “Intelligence” Lives
Every “understanding” or “reasoning” capability of the model is encoded in the millions or billions of numbers inside:
-
W_Q, W_K, W_V, W_O -
W₁, W₂ -
EandP
Each parameter fine-tunes how inputs mix, how attention flows, and how representations evolve.
At scale, these matrices form a distributed semantic memory — not rules, but high-dimensional geometry learned from data.
8. Summary of the Flow
| Stage | Operation | Parameters | Output |
|---|---|---|---|
| 1 | Tokenization | Vocabulary | Token IDs |
| 2 | Embedding | E, P |
Token vectors |
| 3 | Attention | W_Q, W_K, W_V, W_O |
Contextual features |
| 4 | FFN | W₁, W₂ |
Transformed semantics |
| 5 | Output | Eᵀ |
Next-token logits |
Closing Thought
Understanding a model like Phi-3 or LLaMA at the parameter level reveals a simple but profound truth: these “intelligent” systems are deterministic numerical pipelines. The complexity and creativity we perceive are emergent properties of large-scale optimization in these matrices — a symphony of dot products and nonlinearities that together simulate reasoning.
In essence:
A language model doesn’t “know” words — it shapes probability landscapes where meaning naturally emerges through matrix multiplication.
I'll break down what happens at the parameter level in Phi-3 (or any transformer model) as tokens flow through the network. Let me trace the journey of a single token through each stage.
1. Token to Embedding
For Phi-3,
d_model = 3072. So you get a 3072-dimensional vector for each token.2. Positional Encoding
Phi-3 uses RoPE (Rotary Position Embedding), which doesn't add separate parameters. Instead, it rotates the embeddings based on position during the attention calculation.
3. Attention Mechanism
For each attention layer (Phi-3 has 32 layers):
Parameters:
W_Q[d_model × d_model] = [3072 × 3072]W_K[d_model × d_model] = [3072 × 3072]W_V[d_model × d_model] = [3072 × 3072]W_O[d_model × d_model] = [3072 × 3072] (output projection)Operations:
Multi-head split: Phi-3 has 32 attention heads, so:
Attention scores:
Concatenate heads and project:
4. First Residual + LayerNorm
LayerNorm parameters:
gamma[3072] - scalebeta[3072] - shift5. Feed-Forward Network (FFN)
Parameters:
W_1[d_model × d_ff] = [3072 × 8192] (up-projection)W_2[d_ff × d_model] = [8192 × 3072] (down-projection)Phi-3 uses
d_ff = 8192(intermediate size).Operations:
Note: Phi-3 actually uses SwiGLU which requires TWO up-projections, so there's actually:
W_gate[3072 × 8192]W_up[3072 × 8192]W_down[8192 × 3072]6. Second Residual + LayerNorm
Another set of
gammaandbetaparameters [3072 each].7. Repeat 32 Times
Steps 3-6 repeat for all 32 transformer layers.
8. Final LayerNorm
Final
gammaandbeta[3072 each].9. Output Projection to Logits
Parameter:
W_lm_head[d_model × vocab_size] = [3072 × 32064](Phi-3 vocabulary size is ~32k tokens)
Operation:
Each position now has a 32064-dimensional vector representing the probability distribution over all possible next tokens.
10. Sampling/Decoding
Total Parameter Count Breakdown
For Phi-3 (3.8B parameters):
Total: ~3.8B parameters
This is the complete journey from token ID to output logits at the parameter level!