Introduction
Modern AI — specifically large language models, image generators, and classification systems — is built on a surprisingly compact mathematical foundation. Everything from ChatGPT generating a sentence to Stable Diffusion rendering an image reduces, at the lowest level, to the same operations: matrix multiplication, nonlinear functions, and gradient descent. The sophistication is in how these primitives are composed and scaled.
This article develops the mathematics from first principles. We begin with linear algebra and probability theory, build through calculus-based optimization, construct the neural network formalism, derive backpropagation, and culminate in the attention mechanism and transformer architecture that underlies every major language model in 2026. No shortcuts are taken. Every equation is derived or explained from its components.
Prerequisites: comfort with high school algebra, basic calculus notation (derivatives, chain rule). Everything else is built up here.
Linear Algebra
THE LANGUAGE OF AI
1.1 Vectors and Vector Spaces
A vector is an ordered list of numbers. In AI, vectors represent everything: a word, an image patch, a user preference, a hidden state. A vector x in n-dimensional real space is written:
x = [x₁, x₂, ..., xₙ]ᵀ ∈ ℝⁿ
The set ℝⁿ is a vector space: it is closed under addition and scalar multiplication, has a zero vector, and satisfies the eight vector space axioms. This algebraic structure is what makes linear algebra applicable to machine learning — neural network layers are linear maps between vector spaces.
1.1.1 The Dot Product and Cosine Similarity
The dot product of two vectors x, y ∈ ℝⁿ is:
x · y = xᵀy = Σᵢ xᵢyᵢ = ‖x‖‖y‖cos(θ)
where θ is the angle between the vectors. The dot product measures alignment: it is maximized when x and y point in the same direction (cos θ = 1), zero when they are perpendicular (cos θ = 0), and minimized when they are antiparallel (cos θ = -1). This geometric interpretation underlies attention mechanisms — the dot product between a query and a key measures how much they "attend" to each other.
Cosine similarity normalizes the dot product by the product of vector magnitudes:
cos_sim(x, y) = (x · y) / (‖x‖ · ‖y‖) ∈ [-1, 1]
Cosine similarity is used extensively in NLP for measuring semantic similarity between word embeddings: vectors representing words with similar meanings cluster together in embedding space and have cosine similarity close to 1.
1.2 Matrices and Linear Transformations
A matrix W ∈ ℝᵐˣⁿ represents a linear transformation from ℝⁿ to ℝᵐ. Matrix-vector multiplication:
y = Wx, where yᵢ = Σⱼ Wᵢⱼ xⱼ
This is the core operation of a neural network layer. Each row Wi· is a weight vector; the output yi is the dot product of the i-th row with the input x. The matrix W encodes what the layer has learned: which linear combinations of input features to compute.
1.2.1 Matrix Multiplication
For matrices A ∈ ℝᵐˣⁿ and B ∈ ℝⁿˣᵖ, their product C = AB ∈ ℝᵐˣᵖ has entries:
Cᵢⱼ = Σₖ Aᵢₖ Bₖⱼ = Aᵢ· · B·ⱼ
Matrix multiplication is not commutative (AB ≠ BA in general) but is associative: A(BC) = (AB)C. The computational cost of multiplying an m×n matrix by an n×p matrix is O(mnp) — the dominant cost in large neural networks. For a transformer with hidden dimension d = 4096 processing a sequence of 2048 tokens, a single matrix multiply costs O(2048 × 4096 × 4096) ≈ 34 billion operations.
1.2.2 Transpose and Symmetric Matrices
The transpose Wᵀ of a matrix W flips rows and columns: (Wᵀ)ij = Wji. Key properties: (AB)ᵀ = BᵀAᵀ, (Wᵀ)ᵀ = W. A symmetric matrix satisfies W = Wᵀ.
The attention weight matrix QKᵀ/√dk in transformers is computed as a matrix product involving a transpose — the query matrix Q ∈ ℝˢˣᵈ multiplied by the key matrix transposed Kᵀ ∈ ℝᵈˣˢ, producing an s×s attention weight matrix where s is sequence length.
1.3 Eigenvalues, Eigenvectors, and SVD
A vector v is an eigenvector of matrix A if:
Av = λv
where λ is the corresponding eigenvalue. Eigenvectors are directions that matrix A scales without rotating. The set of all eigenvalues is the spectrum of A; it characterizes the matrix's behavior: large eigenvalues amplify their corresponding directions, small eigenvalues attenuate them.
The Singular Value Decomposition (SVD) generalizes eigendecomposition to non-square matrices:
W = U Σ Vᵀ
where U ∈ ℝᵐˣᵐ and V ∈ ℝⁿˣⁿ are orthogonal matrices (rotation/reflection), and Σ ∈ ℝᵐˣⁿ is diagonal with non-negative entries σ₁ ≥ σ₂ ≥ ... ≥ σᵣ (the singular values). SVD is used in weight compression: the low-rank approximation Wk = Uk Σk Vkᵀ using only the top-k singular values provides the best rank-k approximation to W in Frobenius norm — the basis of LoRA (Low-Rank Adaptation) fine-tuning.
1.4 Norms and Regularization
The L2 norm (Euclidean norm) of a vector x ∈ ℝⁿ:
‖x‖₂ = √(Σᵢ xᵢ²) = √(xᵀx)
The Frobenius norm of a matrix W ∈ ℝᵐˣⁿ:
‖W‖_F = √(Σᵢ Σⱼ Wᵢⱼ²) = √(tr(WᵀW))
Norms quantify the "size" of vectors and matrices. In optimization, adding a norm penalty to the loss function — regularization — penalizes large weights and prevents overfitting. L2 regularization (weight decay) adds λ‖W‖²F to the loss, causing weights to decay toward zero during training.
Probability Theory & Information Theory
2.1 Probability Distributions
A probability distribution P over a discrete sample space X assigns P(x) ≥ 0 to each outcome x ∈ X with Σx P(x) = 1. Language modeling is precisely this: estimating a distribution over token sequences.
For continuous variables, the probability density function (PDF) f(x) satisfies f(x) ≥ 0 and ∫f(x)dx = 1. The probability of a range is P(a ≤ X ≤ b) = ∫ab f(x)dx.
2.1.1 The Softmax Distribution
The softmax function converts a vector of real-valued logits z ∈ ℝⁿ into a probability distribution:
softmax(z)ᵢ = exp(zᵢ) / Σⱼ exp(zⱼ)
Properties: all outputs are positive, they sum to 1 (valid probability distribution), and large logits map to probabilities close to 1 while small logits map to near 0. Temperature scaling modifies this: softmax(z/T) — at T → 0 it approaches argmax (one-hot), at T → ∞ it approaches uniform. Language models sample from this distribution at each generation step.
The numerical stability issue: exp(z) overflows for large z. The standard fix — subtract the maximum before exponentiation:
softmax(z)ᵢ = exp(zᵢ - max(z)) / Σⱼ exp(zⱼ - max(z))
2.2 Information Theory
2.2.1 Shannon Entropy
The entropy H(P) of a distribution P measures its uncertainty — the average information content of an outcome:
H(P) = -Σₓ P(x) log₂ P(x) [bits]
Entropy is maximized when P is uniform (maximum uncertainty) and minimized (= 0) when P is deterministic. A fair coin has H = 1 bit; a six-sided die has H = log₂(6) ≈ 2.58 bits. Language entropy measures how many bits per token are needed to encode a language.
2.2.2 Cross-Entropy and KL Divergence
The cross-entropy H(P, Q) measures the expected code length when using distribution Q to encode samples from the true distribution P:
H(P, Q) = -Σₓ P(x) log Q(x)
This is the standard training loss for classification and language modeling: P is the true (one-hot) label distribution, Q is the model's predicted distribution. Minimizing cross-entropy is equivalent to maximizing the log-likelihood of the data under the model.
The KL divergence DKL(P‖Q) measures how much Q differs from P:
D_KL(P‖Q) = Σₓ P(x) log(P(x)/Q(x)) = H(P,Q) - H(P) ≥ 0
KL divergence is zero if and only if P = Q almost everywhere, and positive otherwise (Gibbs' inequality). It is not symmetric: DKL(P‖Q) ≠ DKL(Q‖P). It appears in RLHF training: the KL penalty term that prevents the fine-tuned model from deviating too far from the base model is DKL(πθ ‖ πref).
2.2.3 Perplexity
Perplexity is the exponential of the cross-entropy loss — the standard metric for language model evaluation:
PPL(P, Q) = exp(H(P, Q)) = exp(-1/N Σₙ log Q(tₙ | t₁,...,tₙ₋₁))
Intuitively, perplexity is the effective vocabulary size the model is choosing from at each step. A perplexity of 20 means the model is, on average, as uncertain as if it were choosing uniformly from 20 equally likely options. State-of-the-art language models achieve perplexity of 3–8 on standard benchmarks.
2.3 Bayes' Theorem
Bayes' theorem is the foundation of probabilistic machine learning:
P(θ|D) = P(D|θ) · P(θ) / P(D)
where θ are model parameters, D is observed data, P(θ|D) is the posterior, P(D|θ) is the likelihood, P(θ) is the prior, and P(D) = ∫P(D|θ)P(θ)dθ is the marginal likelihood (evidence). Maximum Likelihood Estimation (MLE) maximizes P(D|θ), ignoring the prior. Maximum A Posteriori (MAP) estimation maximizes P(θ|D), equivalent to MLE with regularization proportional to -log P(θ).
Modern deep learning typically uses MLE/MAP rather than full Bayesian inference, because computing the posterior over billions of parameters is computationally intractable. Uncertainty estimation in LLMs remains an open research problem.
Calculus and Optimization
3.1 Derivatives and the Gradient
The derivative of a scalar function f: ℝ → ℝ at point x measures the instantaneous rate of change:
f'(x) = df/dx = lim_{h→0} [f(x+h) - f(x)] / h
For a function f: ℝⁿ → ℝ (as in a loss function), the gradient ∇f ∈ ℝⁿ is the vector of partial derivatives:
∇f(x) = [∂f/∂x₁, ∂f/∂x₂, ..., ∂f/∂xₙ]ᵀ
The gradient points in the direction of steepest ascent. Moving in the negative gradient direction therefore descends the loss function — the basis of gradient descent.
3.1.1 The Jacobian and Hessian
For a vector-valued function f: ℝⁿ → ℝᵐ, the Jacobian J ∈ ℝᵐˣⁿ contains all first-order partial derivatives: Jij = ∂fi/∂xj. The Jacobian of a layer's output with respect to its input is central to backpropagation.
The Hessian H ∈ ℝⁿˣⁿ of a scalar function contains all second-order partial derivatives: Hij = ∂²f/(∂xi ∂xj). The Hessian characterizes the curvature of the loss landscape: positive definite Hessian implies a local minimum, negative definite implies a local maximum, indefinite implies a saddle point.
3.1.2 The Chain Rule
The chain rule is the mathematical foundation of backpropagation. For composite functions g(f(x)):
d/dx g(f(x)) = g'(f(x)) · f'(x)
For vector-valued functions, the multivariate chain rule uses Jacobians:
∂z/∂x = (∂z/∂y) · (∂y/∂x) = J_g · J_f
A neural network is a composition of functions: L(θ) = loss(fₙ(...f₂(f₁(x; W₁)); Wₙ)). The chain rule applied recursively through this composition computes ∂L/∂Wi for every layer i — this recursive application is backpropagation.
3.2 Gradient Descent and Its Variants
3.2.1 Stochastic Gradient Descent
Vanilla gradient descent updates parameters θ in the negative gradient direction:
θ_{t+1} = θ_t - η · ∇_θ L(θ_t)
where η is the learning rate. Computing the true gradient over the full dataset is expensive — O(N) forward passes per step. Stochastic Gradient Descent (SGD) approximates it using a mini-batch B ⊂ D of size b:
θ_{t+1} = θ_t - η · (1/b) Σ_{x∈B} ∇_θ ℓ(f(x; θ), y)
The mini-batch gradient is an unbiased estimator of the full gradient: E[gB] = ∇L. Its variance decreases with batch size b: Var(gB) = σ²/b.
3.2.2 Momentum
Momentum accumulates a velocity vector v in the direction of persistent gradient components:
v_{t+1} = β · v_t + (1-β) · ∇_θ L(θ_t)
θ_{t+1} = θ_t - η · v_{t+1}
With β ∈ [0.9, 0.99], momentum effectively averages gradients over 1/(1-β) steps — 10 steps at β=0.9, 100 steps at β=0.99. This accelerates convergence along flat dimensions and dampens oscillation along steep dimensions.
3.2.3 Adam: Adaptive Moment Estimation
Adam (Kingma & Ba, 2015) maintains per-parameter adaptive learning rates by tracking both the first moment (mean) and second moment (uncentered variance) of gradients:
m_t = β₁ · m_{t-1} + (1-β₁) · g_t
v_t = β₂ · v_{t-1} + (1-β₂) · g_t²
m̂_t = m_t / (1-β₁ᵗ), v̂_t = v_t / (1-β₂ᵗ)
θ_{t+1} = θ_t - η · m̂_t / (√v̂_t + ε)
Typical hyperparameters: β₁ = 0.9, β₂ = 0.999, ε = 10⁻⁸, η = 3×10⁻⁴. The effective learning rate per parameter is η/√v̂_t — parameters with high gradient variance get smaller effective steps; parameters with low gradient variance get larger steps.
3.2.4 Learning Rate Scheduling
The learning rate η is not typically held constant. Common schedules include linear warmup + cosine decay (used in most transformer pretraining), cosine annealing with restarts, and polynomial decay. The warmup phase is critical for large models: at initialization, gradient estimates are noisy and a large initial learning rate can destabilize training.
Neural Network Architecture
MATHEMATICS
4.1 The Feedforward Layer
A single neural network layer computes a linear transformation followed by a nonlinear activation function σ:
h = σ(Wx + b)
where W ∈ ℝᵐˣⁿ is the weight matrix, b ∈ ℝᵐ is the bias vector, x ∈ ℝⁿ is the input, and h ∈ ℝᵐ is the hidden state. Without σ, stacking layers would be equivalent to a single linear transformation. The activation function breaks this linearity, making deep networks capable of representing exponentially more complex functions.
4.2 Activation Functions
4.2.1 ReLU and Variants
ReLU(x) = max(0, x) = x · 𝟙[x > 0]
Properties: computationally trivial, gradient is either 0 (x < 0) or 1 (x > 0), produces sparse activations, does not saturate for positive values. Problem: "dying ReLU" — neurons with x < 0 have zero gradient and cannot recover.
GeLU (Gaussian Error Linear Unit), used in GPT and BERT:
GELU(x) = x · Φ(x) = x · P(Z ≤ x) where Z ~ N(0,1)
SwiGLU, used in LLaMA/PaLM, is a gated variant:
SwiGLU(x, W, V) = Swish(xW) ⊙ (xV)
4.2.2 Sigmoid and Tanh
σ(x) = 1/(1 + e⁻ˣ) ∈ (0,1)
tanh(x) = (eˣ - e⁻ˣ)/(eˣ + e⁻ˣ) ∈ (-1,1)
Both suffer from vanishing gradients in deep networks: for |x| >> 0, the gradient σ'(x) = σ(x)(1-σ(x)) → 0, causing gradients to vanish exponentially as they propagate backward. This was the primary obstacle to training deep networks before ReLU and residual connections.
4.3 Multilayer Perceptrons (MLPs)
An L-layer MLP computes:
h₀ = x
hₗ = σ(Wₗ hₗ₋₁ + bₗ) for l = 1,...,L-1
ŷ = softmax(W_L h_{L-1} + b_L)
The Universal Approximation Theorem (Hornik, 1989) states that a single hidden layer MLP with sufficient width can approximate any continuous function on a compact domain to arbitrary precision. However, deep networks are exponentially more parameter-efficient than shallow ones for many function classes.
4.4 Backpropagation
Backpropagation computes ∂L/∂W for every weight matrix W in a neural network, using the chain rule applied backward through the computational graph.
Define the error signals (upstream gradients):
δₗ = ∂L/∂zₗ where zₗ = Wₗhₗ₋₁ + bₗ
The backpropagation recurrence:
δ_L = ∂L/∂z_L
δₗ = (Wₗ₊₁ᵀ δₗ₊₁) ⊙ σ'(zₗ)
∂L/∂Wₗ = δₗ hₗ₋₁ᵀ
∂L/∂bₗ = δₗ
# Forward pass
def forward(x, layers):
activations = [x]
pre_activations = []
for W, b in layers:
z = W @ activations[-1] + b
pre_activations.append(z)
activations.append(relu(z))
return activations, pre_activations
# Backward pass
def backward(loss_grad, activations, pre_acts, layers):
delta = loss_grad
grads = []
for l in reversed(range(len(layers))):
W, b = layers[l]
dW = delta @ activations[l].T # ∂L/∂W
db = delta.sum(axis=1) # ∂L/∂b
grads.insert(0, (dW, db))
if l > 0:
delta = (W.T @ delta) * relu_grad(pre_acts[l-1])
return grads
Why Backpropagation Is Efficient
A naive approach would perturb each weight by ε and measure (L(W+εeij) - L(W))/ε. For n parameters, this requires n forward passes. Backpropagation computes all n gradients in a single backward pass — O(n) rather than O(n²) — by reusing intermediate computations via dynamic programming.
4.5 The Vanishing and Exploding Gradient Problem
The error signal δₗ involves a product of L-l Jacobian matrices. If the spectral radius ρ(J) < 1, gradients vanish exponentially. If ρ(J) > 1, gradients explode. Solutions include residual connections (h = F(x) + x), gradient clipping (‖g‖ > γ → rescale), layer normalization, and careful initialization (He initialization: W ~ N(0, 2/nin)).
Normalization Layers
5.1 Batch Normalization
μ_B = (1/m) Σᵢ xᵢ, σ²_B = (1/m) Σᵢ (xᵢ - μ_B)²
x̂ᵢ = (xᵢ - μ_B) / √(σ²_B + ε)
yᵢ = γ · x̂ᵢ + β
γ and β are learned parameters that allow the network to undo the normalization if optimal. BatchNorm is problematic for language models because sequences have variable length and batch statistics are unstable for small batches.
5.2 Layer Normalization
μₓ = (1/H) Σⱼ xⱼ, σ²ₓ = (1/H) Σⱼ (xⱼ - μₓ)²
x̂ⱼ = (xⱼ - μₓ) / √(σ²ₓ + ε), yⱼ = γⱼ x̂ⱼ + βⱼ
Layer norm statistics are computed per-example, not per-batch — independent of batch size and work identically at training and inference. This makes LayerNorm the standard normalization in transformers. In Pre-LN transformers (GPT-2+), LayerNorm is applied before the attention and MLP sublayers, producing more stable training dynamics at scale.
5.3 RMSNorm
RMSNorm(x) = x / RMS(x) · γ, where RMS(x) = √(1/H Σⱼ xⱼ²)
RMSNorm is ~10% faster than LayerNorm and produces comparable results. It is used in LLaMA, Mistral, and Gemma architectures. The intuition: the re-centering step in LayerNorm has minimal effect on model quality; the dominant benefit comes from the re-scaling that controls activation magnitude.
Embeddings & Positional Encoding
6.1 Token Embeddings
A token embedding matrix E ∈ ℝ|V|×d maps each discrete token in vocabulary V to a dense d-dimensional vector:
e_t = E[t] ∈ ℝᵈ
The embedding dimension d is typically 512–4096 for modern models. The embedding matrix is learned jointly with the rest of the model — nearby tokens in embedding space have similar contextual usage patterns. The Word2Vec discovery that arithmetic in embedding space captures semantic relationships (king - man + woman ≈ queen) is an emergent property of training on large corpora.
Weight tying: in many models, the output projection matrix Wout is shared with Eᵀ, reducing parameters by |V|×d with no empirical degradation.
6.2 Positional Encoding
Transformers have no intrinsic ordering — attention is permutation-equivariant. Positional encoding injects position information.
6.2.1 Sinusoidal Positional Encoding
PE(pos, 2i) = sin(pos / 10000^(2i/d))
PE(pos, 2i+1) = cos(pos / 10000^(2i/d))
6.2.2 Rotary Position Embeddings (RoPE)
RoPE (Su et al., 2021), used in LLaMA, Mistral, and GPT-NeoX, encodes position by rotating query and key vectors in 2D subspaces:
q_m = R_m q, k_n = R_n k
(R_m q)ᵀ(R_n k) = qᵀ R_{m-n}ᵀ k = f(q, k, m-n)
The dot product depends only on relative position m-n, not absolute positions. This makes RoPE strictly superior to sinusoidal encoding for relative position sensitivity and enables context length extrapolation.
The Attention Mechanism
7.1 Scaled Dot-Product Attention
The attention mechanism computes weighted sums of values, where weights are determined by query-key similarity:
Attention(Q, K, V) = softmax(QKᵀ / √dₖ) · V
where Q ∈ ℝˢˣᵈᵏ (queries), K ∈ ℝˢˣᵈᵏ (keys), V ∈ ℝˢˣᵈᵥ (values). The scaling factor 1/√dk prevents the dot products from growing too large — without it, softmax saturates into near-one-hot distributions, causing vanishing gradients.
The attention weight matrix A = softmax(QKᵀ/√dk) ∈ ℝˢˣˢ has entry Aij representing how much position i attends to position j. The output is a weighted sum of values: outputi = Σj Aij Vj. This allows each position to aggregate information from any other position in the sequence — the defining capability of transformers.
7.2 Multi-Head Attention
Multi-head attention runs h parallel attention operations in lower-dimensional subspaces:
headᵢ = Attention(QWᵢQ, KWᵢK, VWᵢV)
MultiHead(Q,K,V) = Concat(head₁,...,headₕ) Wᴼ
Different heads learn different attention patterns: some attend to recent positions (local context), others to syntactic structure, others to semantic relationships. With h=12 heads and d=768 (BERT-base), each head operates in a 64-dimensional subspace.
7.3 Causal (Masked) Attention
A = softmax((QKᵀ/√dₖ) + M), M_ij = 0 if j≤i, -∞ if j>i
Adding -∞ to masked positions before softmax sets exp(-∞) = 0, zeroing those attention weights. The resulting attention is lower-triangular: position i attends to positions 1,...,i but not i+1,...,s.
7.4 Computational Complexity
Standard attention is O(s²d) in time and O(s²) in memory due to the s×s attention matrix. Efficient variants include FlashAttention (tiling for 2-4× speedup, exact), sparse attention (O(s·w) for window size w), linear attention (O(s·d²)), and Multi-Query Attention (shares K and V across heads, reducing KV cache size by factor h).
The Transformer Architecture
8.1 The Transformer Block
A transformer layer consists of two sublayers: multi-head self-attention and a position-wise MLP, each with residual connection and layer normalization:
x' = x + MultiHead(LN(x), LN(x), LN(x))
x'' = x' + MLP(LN(x'))
The MLP sublayer:
MLP(x) = W₂ · GELU(W₁x + b₁) + b₂
where W₁ ∈ ℝ4d×d and W₂ ∈ ℝd×4d. For d=4096, this MLP has ~134 million parameters per layer.
8.2 Parameter Count
| Component | Parameters | Notes |
|---|---|---|
| Token embedding | d × |V| | Weight-tied with output |
| Positional embedding | d × s_max | Absent in RoPE models |
| Attention (per layer) | 4d² | Q,K,V,O projections |
| MLP (per layer) | 8d² | SwiGLU: 12d² |
| LayerNorm (per layer) | 4d | Negligible |
| Total (L layers) | ≈ 12Ld² | For large d |
For GPT-3 (L=96, d=12288): 12 × 96 × 12288² ≈ 175 billion parameters.
8.3 The Residual Stream
The residual connection pattern creates the residual stream — a d-dimensional vector flowing through the entire network, additively modified by each layer:
x₀ = Embed(tokens) + PE
xₗ = xₗ₋₁ + Attn_l(xₗ₋₁) + MLP_l(...)
logits = Unembed(LN(x_L))
The mechanistic interpretability view treats the residual stream as a communication channel: attention heads and MLP layers read from it and write additive updates, enabling interpretability research into information flow.
Training Dynamics at Scale
9.1 The Loss Landscape
The loss function of a neural network L: ℝⁿ → ℝ (where n ~ 10⁹–10¹²) is highly non-convex. Theoretically intractable, yet large overparameterized networks are relatively easy to train due to overparameterization (most local minima are near-global), NTK regime (infinite-width networks converge like kernel regression), and implicit regularization (SGD finds flat, generalizable minima).
9.2 The Chinchilla Scaling Laws
Hoffmann et al. (2022) derived empirical scaling laws relating compute budget C, model size N, and dataset size D:
L(N, D) = E + A/Nᵅ + B/Dᵝ
where E ≈ 1.69 (irreducible entropy), α ≈ β ≈ 0.34. The key finding: for fixed compute C = 6ND, the optimal trade-off is Nopt ∝ C0.5 and Dopt ∝ C0.5 — model size and dataset size should scale equally. This implies GPT-3 (175B params, 300B tokens) was significantly undertrained.
9.3 Mixed Precision Training
Modern models train in mixed precision: forward pass in float16 for memory efficiency, weight updates in float32 for numerical stability.
# Forward pass in fp16
with torch.autocast(device_type='cuda', dtype=torch.float16):
logits = model(input_ids)
loss = cross_entropy(logits, targets)
# Backward pass: gradients in fp16
scaler.scale(loss).backward()
# Gradient unscaling + clipping in fp32
scaler.unscale_(optimizer)
torch.nn.utils.clip_grad_norm_(model.parameters(), max_norm=1.0)
# Weight update in fp32 (master weights)
scaler.step(optimizer)
scaler.update()
BFloat16 is preferred over float16: same 8-bit exponent as float32 (matching dynamic range) but only 7 mantissa bits, avoiding overflow/underflow without loss scaling.
9.4 Distributed Training
Training a 70B model requires ~140 GB GPU memory for weights alone. Strategies include data parallelism (replicate model, sync gradients), tensor parallelism (split weight matrices across GPUs), pipeline parallelism (assign layers to different GPUs), and ZeRO (partition optimizer states, gradients, and weights across workers, enabling arbitrarily large models).
Language Modeling & Generation
10.1 The Autoregressive Objective
L(θ) = -E_{x~D}[ Σₜ log P_θ(xₜ | x_{<t}) ]
At each position t, the model predicts a distribution over the full vocabulary V given all preceding tokens. This simple next-token prediction objective, applied at enormous scale, is sufficient to learn rich representations of language, factual knowledge, and reasoning.
10.2 Decoding Strategies
10.2.1 Temperature Scaling
P(tᵢ) = exp(zᵢ/T) / Σⱼ exp(zⱼ/T)
T < 1 sharpens (more deterministic), T > 1 flattens (more random). T = 0 reduces to argmax (greedy decoding).
10.2.2 Top-k Sampling
P'(tᵢ) = P(tᵢ) / Σ_{j∈top-k} P(tⱼ) for i ∈ top-k
10.2.3 Nucleus (Top-p) Sampling
V_p = {t : Σ_{v sorted} P(v) ≤ p}, P'(tᵢ) = P(tᵢ)/p
Top-p is adaptive: more tokens when uncertain, fewer when confident. Typical settings: temperature 0.7–1.0, top-p 0.9–0.95.
10.2.4 Beam Search
score(x_{1:t}) = (1/t) Σᵢ log P(xᵢ|x_{<i})
Beam search maximizes probability directly. Preferred for translation and structured outputs; sampling is preferred for open-ended generation.
10.3 The Key-Value Cache
During autoregressive inference, the KV cache stores previously computed K and V tensors, making each generation step O(sd) instead of O(s²d), but consuming O(2Lsd) memory — for a 70B model, ~320 GB in float16. KV cache compression is an active research area critical for long-context inference.
kv_cache = {}
def generate_step(input_token, kv_cache, model):
q, k, v = model.qkv_proj(embed(input_token))
if layer_idx in kv_cache:
K = torch.cat([kv_cache[layer_idx][0], k], dim=1)
V = torch.cat([kv_cache[layer_idx][1], v], dim=1)
else:
K, V = k, v
kv_cache[layer_idx] = (K, V)
output = scaled_dot_product_attention(q, K, V)
return output, kv_cache
Reinforcement Learning from Human Feedback
11.1 The Three-Stage Pipeline
RLHF aligns pretrained language models with human preferences through three stages: supervised fine-tuning (SFT), reward model training, and RL optimization.
11.2 The Reward Model
A reward model Rφ: (prompt, completion) → ℝ is trained on pairwise human preference data:
L_RM(φ) = -E[ log σ(R_φ(x,y_w) - R_φ(x,y_l)) ]
11.3 PPO Optimization
The language model πθ is optimized to maximize expected reward while staying close to the reference policy:
max_θ E[ R_φ(x,y) ] - β · D_KL(π_θ ‖ π_ref)
L_PPO = E_t[ min(rₜ Aₜ, clip(rₜ, 1-ε, 1+ε) Aₜ) ]
The KL term prevents reward hacking — the model diverging to exploit weaknesses in the reward model. PPO's clipped surrogate objective prevents excessively large policy updates that would destabilize training.
11.4 Direct Preference Optimization (DPO)
DPO (Rafailov et al., 2023) bypasses the reward model entirely:
L_DPO(θ) = -E[ log σ(β log π_θ(y_w|x)/π_ref(y_w|x) - β log π_θ(y_l|x)/π_ref(y_l|x)) ]
DPO is mathematically equivalent to RLHF but simpler, more stable, and requires no separate reward model. Most modern fine-tuning pipelines use DPO or variants (IPO, KTO) instead of PPO.
Summary: Mathematical Building Blocks
| Concept | Core Equation | Role in AI |
|---|---|---|
| Matrix multiply | y = Wx + b | Every linear layer, attention projection |
| Softmax | exp(z)/Σexp(z) | Token probabilities, attention weights |
| Cross-entropy | H(P,Q) = -Σ P log Q | Training loss for classification/LM |
| Chain rule | ∂z/∂x = (∂z/∂y)(∂y/∂x) | Backpropagation through layers |
| Adam update | θ -= η·m̂/√(v̂+ε) | Parameter updates during training |
| Dot-product attention | softmax(QKᵀ/√dₖ)V | Core of transformer self-attention |
| Layer norm | (x-μ)/σ · γ + β | Stabilizes activations per sample |
| RoPE | (Rₘq)ᵀ(Rₙk)=f(q,k,m-n) | Relative positional encoding |
| KL divergence | Σ P log(P/Q) | RLHF regularization, VAE objective |
| Perplexity | exp(H(P,Q)) | Language model evaluation metric |
| SVD | W=UΣVᵀ | Weight compression, LoRA fine-tuning |
| PPO objective | E[min(r·A, clip(r)·A)] | RLHF policy optimization |
References
Vaswani, A., et al. (2017). Attention is all you need. NeurIPS 2017.
He, K., et al. (2016). Deep residual learning for image recognition. CVPR 2016.
Ba, J. L., et al. (2016). Layer normalization. arXiv:1607.06450.
Kingma, D. P., & Ba, J. (2015). Adam: A method for stochastic optimization. ICLR 2015.
Hochreiter, S., & Schmidhuber, J. (1997). Long short-term memory. Neural Computation 9(8).
Ioffe, S., & Szegedy, C. (2015). Batch normalization. ICML 2015.
Elhage, N., et al. (2021). A mathematical framework for transformer circuits. Transformer Circuits Thread.
Hoffmann, J., et al. (2022). Training compute-optimal large language models. arXiv:2203.15556.
Su, J., et al. (2021). RoFormer: Enhanced transformer with rotary position embedding. arXiv:2104.09864.
Dao, T., et al. (2022). FlashAttention: Fast and memory-efficient exact attention. NeurIPS 2022.
Rafailov, R., et al. (2023). Direct preference optimization. NeurIPS 2023.
Zhang, B., & Sennrich, R. (2019). Root mean square layer normalization. NeurIPS 2019.
Schulman, J., et al. (2017). Proximal policy optimization algorithms. arXiv:1707.06347.
Ouyang, L., et al. (2022). Training language models to follow instructions with human feedback. NeurIPS 2022.
Kirchenbauer, J., et al. (2023). A watermark for large language models. ICML 2023.
Rajbhandari, S., et al. (2020). ZeRO: Memory optimizations toward training trillion parameter models. SC 2020.