The Mathematics Behind AI

Introduction

Modern AI — specifically large language models, image generators, and classification systems — is built on a surprisingly compact mathematical foundation. Everything from ChatGPT generating a sentence to Stable Diffusion rendering an image reduces, at the lowest level, to the same operations: matrix multiplication, nonlinear functions, and gradient descent. The sophistication is in how these primitives are composed and scaled.

This article develops the mathematics from first principles. We begin with linear algebra and probability theory, build through calculus-based optimization, construct the neural network formalism, derive backpropagation, and culminate in the attention mechanism and transformer architecture that underlies every major language model in 2026. No shortcuts are taken. Every equation is derived or explained from its components.

Prerequisites: comfort with high school algebra, basic calculus notation (derivatives, chain rule). Everything else is built up here.

01

Linear Algebra

THE LANGUAGE OF AI

1.1 Vectors and Vector Spaces

A vector is an ordered list of numbers. In AI, vectors represent everything: a word, an image patch, a user preference, a hidden state. A vector x in n-dimensional real space is written:

x = [x₁, x₂, ..., xₙ]ᵀ ∈ ℝⁿ Column vector notation

The set ℝⁿ is a vector space: it is closed under addition and scalar multiplication, has a zero vector, and satisfies the eight vector space axioms. This algebraic structure is what makes linear algebra applicable to machine learning — neural network layers are linear maps between vector spaces.

1.1.1 The Dot Product and Cosine Similarity

The dot product of two vectors x, y ∈ ℝⁿ is:

x · y = xᵀy = Σᵢ xᵢyᵢ = ‖x‖‖y‖cos(θ) Dot product

where θ is the angle between the vectors. The dot product measures alignment: it is maximized when x and y point in the same direction (cos θ = 1), zero when they are perpendicular (cos θ = 0), and minimized when they are antiparallel (cos θ = -1). This geometric interpretation underlies attention mechanisms — the dot product between a query and a key measures how much they "attend" to each other.

Cosine similarity normalizes the dot product by the product of vector magnitudes:

cos_sim(x, y) = (x · y) / (‖x‖ · ‖y‖) ∈ [-1, 1] Cosine similarity

Cosine similarity is used extensively in NLP for measuring semantic similarity between word embeddings: vectors representing words with similar meanings cluster together in embedding space and have cosine similarity close to 1.

1.2 Matrices and Linear Transformations

A matrix W ∈ ℝᵐˣⁿ represents a linear transformation from ℝⁿ to ℝᵐ. Matrix-vector multiplication:

y = Wx, where yᵢ = Σⱼ Wᵢⱼ xⱼ Matrix-vector multiply

This is the core operation of a neural network layer. Each row W_i· is a weight vector; the output y_i is the dot product of the i-th row with the input x. The matrix W encodes what the layer has learned: which linear combinations of input features to compute.

1.2.1 Matrix Multiplication

For matrices A ∈ ℝᵐˣⁿ and B ∈ ℝⁿˣᵖ, their product C = AB ∈ ℝᵐˣᵖ has entries:

Cᵢⱼ = Σₖ Aᵢₖ Bₖⱼ = Aᵢ· · B·ⱼ Matrix multiplication

Matrix multiplication is not commutative (AB ≠ BA in general) but is associative: A(BC) = (AB)C. The computational cost of multiplying an m×n matrix by an n×p matrix is O(mnp) — the dominant cost in large neural networks. For a transformer with hidden dimension d = 4096 processing a sequence of 2048 tokens, a single matrix multiply costs O(2048 × 4096 × 4096) ≈ 34 billion operations.

1.2.2 Transpose and Symmetric Matrices

The transpose W^ᵀ of a matrix W flips rows and columns: (W^ᵀ)_ij = W_ji. Key properties: (AB)^ᵀ = B^ᵀA^ᵀ, (W^ᵀ)^ᵀ = W. A symmetric matrix satisfies W = W^ᵀ.

The attention weight matrix QK^ᵀ/√d_k in transformers is computed as a matrix product involving a transpose — the query matrix Q ∈ ℝˢˣᵈ multiplied by the key matrix transposed K^ᵀ ∈ ℝᵈˣˢ, producing an s×s attention weight matrix where s is sequence length.

1.3 Eigenvalues, Eigenvectors, and SVD

A vector v is an eigenvector of matrix A if:

Av = λv Eigenvector equation

where λ is the corresponding eigenvalue. Eigenvectors are directions that matrix A scales without rotating. The set of all eigenvalues is the spectrum of A; it characterizes the matrix's behavior: large eigenvalues amplify their corresponding directions, small eigenvalues attenuate them.

The Singular Value Decomposition (SVD) generalizes eigendecomposition to non-square matrices:

W = U Σ Vᵀ Singular Value Decomposition

where U ∈ ℝᵐˣᵐ and V ∈ ℝⁿˣⁿ are orthogonal matrices (rotation/reflection), and Σ ∈ ℝᵐˣⁿ is diagonal with non-negative entries σ₁ ≥ σ₂ ≥ ... ≥ σᵣ (the singular values). SVD is used in weight compression: the low-rank approximation W_k = U_k Σ_k V_k^ᵀ using only the top-k singular values provides the best rank-k approximation to W in Frobenius norm — the basis of LoRA (Low-Rank Adaptation) fine-tuning.

1.4 Norms and Regularization

The L2 norm (Euclidean norm) of a vector x ∈ ℝⁿ:

‖x‖₂ = √(Σᵢ xᵢ²) = √(xᵀx) L2 norm

The Frobenius norm of a matrix W ∈ ℝᵐˣⁿ:

‖W‖_F = √(Σᵢ Σⱼ Wᵢⱼ²) = √(tr(WᵀW)) Frobenius norm

Norms quantify the "size" of vectors and matrices. In optimization, adding a norm penalty to the loss function — regularization — penalizes large weights and prevents overfitting. L2 regularization (weight decay) adds λ‖W‖²_F to the loss, causing weights to decay toward zero during training.

02

Probability Theory & Information Theory

2.1 Probability Distributions

A probability distribution P over a discrete sample space X assigns P(x) ≥ 0 to each outcome x ∈ X with Σ_x P(x) = 1. Language modeling is precisely this: estimating a distribution over token sequences.

For continuous variables, the probability density function (PDF) f(x) satisfies f(x) ≥ 0 and ∫f(x)dx = 1. The probability of a range is P(a ≤ X ≤ b) = ∫_a^b f(x)dx.

2.1.1 The Softmax Distribution

The softmax function converts a vector of real-valued logits z ∈ ℝⁿ into a probability distribution:

softmax(z)ᵢ = exp(zᵢ) / Σⱼ exp(zⱼ) Softmax function

Properties: all outputs are positive, they sum to 1 (valid probability distribution), and large logits map to probabilities close to 1 while small logits map to near 0. Temperature scaling modifies this: softmax(z/T) — at T → 0 it approaches argmax (one-hot), at T → ∞ it approaches uniform. Language models sample from this distribution at each generation step.

The numerical stability issue: exp(z) overflows for large z. The standard fix — subtract the maximum before exponentiation:

softmax(z)ᵢ = exp(zᵢ - max(z)) / Σⱼ exp(zⱼ - max(z)) Stable softmax

2.2 Information Theory

2.2.1 Shannon Entropy

The entropy H(P) of a distribution P measures its uncertainty — the average information content of an outcome:

H(P) = -Σₓ P(x) log₂ P(x) [bits] Shannon entropy

Entropy is maximized when P is uniform (maximum uncertainty) and minimized (= 0) when P is deterministic. A fair coin has H = 1 bit; a six-sided die has H = log₂(6) ≈ 2.58 bits. Language entropy measures how many bits per token are needed to encode a language.

2.2.2 Cross-Entropy and KL Divergence

The cross-entropy H(P, Q) measures the expected code length when using distribution Q to encode samples from the true distribution P:

H(P, Q) = -Σₓ P(x) log Q(x) Cross-entropy

This is the standard training loss for classification and language modeling: P is the true (one-hot) label distribution, Q is the model's predicted distribution. Minimizing cross-entropy is equivalent to maximizing the log-likelihood of the data under the model.

The KL divergence D_KL(P‖Q) measures how much Q differs from P:

D_KL(P‖Q) = Σₓ P(x) log(P(x)/Q(x)) = H(P,Q) - H(P) ≥ 0 KL divergence

KL divergence is zero if and only if P = Q almost everywhere, and positive otherwise (Gibbs' inequality). It is not symmetric: D_KL(P‖Q) ≠ D_KL(Q‖P). It appears in RLHF training: the KL penalty term that prevents the fine-tuned model from deviating too far from the base model is D_KL(π_θ ‖ π_ref).

2.2.3 Perplexity

Perplexity is the exponential of the cross-entropy loss — the standard metric for language model evaluation:

PPL(P, Q) = exp(H(P, Q)) = exp(-1/N Σₙ log Q(tₙ | t₁,...,tₙ₋₁)) Perplexity

Intuitively, perplexity is the effective vocabulary size the model is choosing from at each step. A perplexity of 20 means the model is, on average, as uncertain as if it were choosing uniformly from 20 equally likely options. State-of-the-art language models achieve perplexity of 3–8 on standard benchmarks.

2.3 Bayes' Theorem

Bayes' theorem is the foundation of probabilistic machine learning:

P(θ|D) = P(D|θ) · P(θ) / P(D) Bayes' theorem

where θ are model parameters, D is observed data, P(θ|D) is the posterior, P(D|θ) is the likelihood, P(θ) is the prior, and P(D) = ∫P(D|θ)P(θ)dθ is the marginal likelihood (evidence). Maximum Likelihood Estimation (MLE) maximizes P(D|θ), ignoring the prior. Maximum A Posteriori (MAP) estimation maximizes P(θ|D), equivalent to MLE with regularization proportional to -log P(θ).

Modern deep learning typically uses MLE/MAP rather than full Bayesian inference, because computing the posterior over billions of parameters is computationally intractable. Uncertainty estimation in LLMs remains an open research problem.

03

Calculus and Optimization

3.1 Derivatives and the Gradient

The derivative of a scalar function f: ℝ → ℝ at point x measures the instantaneous rate of change:

f'(x) = df/dx = lim_{h→0} [f(x+h) - f(x)] / h Scalar derivative

For a function f: ℝⁿ → ℝ (as in a loss function), the gradient ∇f ∈ ℝⁿ is the vector of partial derivatives:

∇f(x) = [∂f/∂x₁, ∂f/∂x₂, ..., ∂f/∂xₙ]ᵀ Gradient vector

The gradient points in the direction of steepest ascent. Moving in the negative gradient direction therefore descends the loss function — the basis of gradient descent.

3.1.1 The Jacobian and Hessian

For a vector-valued function f: ℝⁿ → ℝᵐ, the Jacobian J ∈ ℝᵐˣⁿ contains all first-order partial derivatives: J_ij = ∂f_i/∂x_j. The Jacobian of a layer's output with respect to its input is central to backpropagation.

The Hessian H ∈ ℝⁿˣⁿ of a scalar function contains all second-order partial derivatives: H_ij = ∂²f/(∂x_i ∂x_j). The Hessian characterizes the curvature of the loss landscape: positive definite Hessian implies a local minimum, negative definite implies a local maximum, indefinite implies a saddle point.

3.1.2 The Chain Rule

The chain rule is the mathematical foundation of backpropagation. For composite functions g(f(x)):

d/dx g(f(x)) = g'(f(x)) · f'(x) Chain rule (scalar)

For vector-valued functions, the multivariate chain rule uses Jacobians:

∂z/∂x = (∂z/∂y) · (∂y/∂x) = J_g · J_f Chain rule (Jacobian)

A neural network is a composition of functions: L(θ) = loss(fₙ(...f₂(f₁(x; W₁)); Wₙ)). The chain rule applied recursively through this composition computes ∂L/∂W_i for every layer i — this recursive application is backpropagation.

3.2 Gradient Descent and Its Variants

3.2.1 Stochastic Gradient Descent

Vanilla gradient descent updates parameters θ in the negative gradient direction:

θ_{t+1} = θ_t - η · ∇_θ L(θ_t) Gradient descent

where η is the learning rate. Computing the true gradient over the full dataset is expensive — O(N) forward passes per step. Stochastic Gradient Descent (SGD) approximates it using a mini-batch B ⊂ D of size b:

θ_{t+1} = θ_t - η · (1/b) Σ_{x∈B} ∇_θ ℓ(f(x; θ), y) Mini-batch SGD

The mini-batch gradient is an unbiased estimator of the full gradient: E[g_B] = ∇L. Its variance decreases with batch size b: Var(g_B) = σ²/b.

3.2.2 Momentum

Momentum accumulates a velocity vector v in the direction of persistent gradient components:

v_{t+1} = β · v_t + (1-β) · ∇_θ L(θ_t) Velocity

θ_{t+1} = θ_t - η · v_{t+1} Update

With β ∈ [0.9, 0.99], momentum effectively averages gradients over 1/(1-β) steps — 10 steps at β=0.9, 100 steps at β=0.99. This accelerates convergence along flat dimensions and dampens oscillation along steep dimensions.

3.2.3 Adam: Adaptive Moment Estimation

Adam (Kingma & Ba, 2015) maintains per-parameter adaptive learning rates by tracking both the first moment (mean) and second moment (uncentered variance) of gradients:

m_t = β₁ · m_{t-1} + (1-β₁) · g_t 1st moment

v_t = β₂ · v_{t-1} + (1-β₂) · g_t² 2nd moment

m̂_t = m_t / (1-β₁ᵗ), v̂_t = v_t / (1-β₂ᵗ) Bias correction

θ_{t+1} = θ_t - η · m̂_t / (√v̂_t + ε) Adam update

Typical hyperparameters: β₁ = 0.9, β₂ = 0.999, ε = 10⁻⁸, η = 3×10⁻⁴. The effective learning rate per parameter is η/√v̂_t — parameters with high gradient variance get smaller effective steps; parameters with low gradient variance get larger steps.

3.2.4 Learning Rate Scheduling

The learning rate η is not typically held constant. Common schedules include linear warmup + cosine decay (used in most transformer pretraining), cosine annealing with restarts, and polynomial decay. The warmup phase is critical for large models: at initialization, gradient estimates are noisy and a large initial learning rate can destabilize training.

04

Neural Network Architecture

MATHEMATICS

4.1 The Feedforward Layer

A single neural network layer computes a linear transformation followed by a nonlinear activation function σ:

h = σ(Wx + b) Single layer

where W ∈ ℝᵐˣⁿ is the weight matrix, b ∈ ℝᵐ is the bias vector, x ∈ ℝⁿ is the input, and h ∈ ℝᵐ is the hidden state. Without σ, stacking layers would be equivalent to a single linear transformation. The activation function breaks this linearity, making deep networks capable of representing exponentially more complex functions.

4.2 Activation Functions

4.2.1 ReLU and Variants

ReLU(x) = max(0, x) = x · 𝟙[x > 0] ReLU

Properties: computationally trivial, gradient is either 0 (x < 0) or 1 (x > 0), produces sparse activations, does not saturate for positive values. Problem: "dying ReLU" — neurons with x < 0 have zero gradient and cannot recover.

GeLU (Gaussian Error Linear Unit), used in GPT and BERT:

GELU(x) = x · Φ(x) = x · P(Z ≤ x) where Z ~ N(0,1) GELU activation

SwiGLU, used in LLaMA/PaLM, is a gated variant:

SwiGLU(x, W, V) = Swish(xW) ⊙ (xV) SwiGLU

4.2.2 Sigmoid and Tanh

σ(x) = 1/(1 + e⁻ˣ) ∈ (0,1) Sigmoid

tanh(x) = (eˣ - e⁻ˣ)/(eˣ + e⁻ˣ) ∈ (-1,1) Hyperbolic tangent

Both suffer from vanishing gradients in deep networks: for |x| >> 0, the gradient σ'(x) = σ(x)(1-σ(x)) → 0, causing gradients to vanish exponentially as they propagate backward. This was the primary obstacle to training deep networks before ReLU and residual connections.

4.3 Multilayer Perceptrons (MLPs)

An L-layer MLP computes:

h₀ = x Input

hₗ = σ(Wₗ hₗ₋₁ + bₗ) for l = 1,...,L-1 Hidden layers

ŷ = softmax(W_L h_{L-1} + b_L) Output layer

The Universal Approximation Theorem (Hornik, 1989) states that a single hidden layer MLP with sufficient width can approximate any continuous function on a compact domain to arbitrary precision. However, deep networks are exponentially more parameter-efficient than shallow ones for many function classes.

4.4 Backpropagation

Backpropagation computes ∂L/∂W for every weight matrix W in a neural network, using the chain rule applied backward through the computational graph.

Define the error signals (upstream gradients):

δₗ = ∂L/∂zₗ where zₗ = Wₗhₗ₋₁ + bₗ Error signal

The backpropagation recurrence:

δ_L = ∂L/∂z_L Output error

δₗ = (Wₗ₊₁ᵀ δₗ₊₁) ⊙ σ'(zₗ) Hidden error

∂L/∂Wₗ = δₗ hₗ₋₁ᵀ Weight gradient

∂L/∂bₗ = δₗ Bias gradient

python Backpropagation implementation

# Forward pass
def forward(x, layers):
    activations = [x]
    pre_activations = []
    for W, b in layers:
        z = W @ activations[-1] + b
        pre_activations.append(z)
        activations.append(relu(z))
    return activations, pre_activations

# Backward pass
def backward(loss_grad, activations, pre_acts, layers):
    delta = loss_grad
    grads = []
    for l in reversed(range(len(layers))):
        W, b = layers[l]
        dW = delta @ activations[l].T  # ∂L/∂W
        db = delta.sum(axis=1)         # ∂L/∂b
        grads.insert(0, (dW, db))
        if l > 0:
            delta = (W.T @ delta) * relu_grad(pre_acts[l-1])
    return grads

Why Backpropagation Is Efficient

A naive approach would perturb each weight by ε and measure (L(W+εe_ij) - L(W))/ε. For n parameters, this requires n forward passes. Backpropagation computes all n gradients in a single backward pass — O(n) rather than O(n²) — by reusing intermediate computations via dynamic programming.

4.5 The Vanishing and Exploding Gradient Problem

The error signal δₗ involves a product of L-l Jacobian matrices. If the spectral radius ρ(J) < 1, gradients vanish exponentially. If ρ(J) > 1, gradients explode. Solutions include residual connections (h = F(x) + x), gradient clipping (‖g‖ > γ → rescale), layer normalization, and careful initialization (He initialization: W ~ N(0, 2/n_in)).

05

Normalization Layers

5.1 Batch Normalization

μ_B = (1/m) Σᵢ xᵢ, σ²_B = (1/m) Σᵢ (xᵢ - μ_B)² Batch stats

x̂ᵢ = (xᵢ - μ_B) / √(σ²_B + ε) Normalize

yᵢ = γ · x̂ᵢ + β Scale & shift

γ and β are learned parameters that allow the network to undo the normalization if optimal. BatchNorm is problematic for language models because sequences have variable length and batch statistics are unstable for small batches.

5.2 Layer Normalization

μₓ = (1/H) Σⱼ xⱼ, σ²ₓ = (1/H) Σⱼ (xⱼ - μₓ)² Layer stats

x̂ⱼ = (xⱼ - μₓ) / √(σ²ₓ + ε), yⱼ = γⱼ x̂ⱼ + βⱼ Layer norm

Layer norm statistics are computed per-example, not per-batch — independent of batch size and work identically at training and inference. This makes LayerNorm the standard normalization in transformers. In Pre-LN transformers (GPT-2+), LayerNorm is applied before the attention and MLP sublayers, producing more stable training dynamics at scale.

5.3 RMSNorm

RMSNorm(x) = x / RMS(x) · γ, where RMS(x) = √(1/H Σⱼ xⱼ²) RMSNorm

RMSNorm is ~10% faster than LayerNorm and produces comparable results. It is used in LLaMA, Mistral, and Gemma architectures. The intuition: the re-centering step in LayerNorm has minimal effect on model quality; the dominant benefit comes from the re-scaling that controls activation magnitude.

06

Embeddings & Positional Encoding

6.1 Token Embeddings

A token embedding matrix E ∈ ℝ^|V|×d maps each discrete token in vocabulary V to a dense d-dimensional vector:

e_t = E[t] ∈ ℝᵈ Token embedding lookup

The embedding dimension d is typically 512–4096 for modern models. The embedding matrix is learned jointly with the rest of the model — nearby tokens in embedding space have similar contextual usage patterns. The Word2Vec discovery that arithmetic in embedding space captures semantic relationships (king - man + woman ≈ queen) is an emergent property of training on large corpora.

Weight tying: in many models, the output projection matrix W_out is shared with E^ᵀ, reducing parameters by |V|×d with no empirical degradation.

6.2 Positional Encoding

Transformers have no intrinsic ordering — attention is permutation-equivariant. Positional encoding injects position information.

6.2.1 Sinusoidal Positional Encoding

PE(pos, 2i) = sin(pos / 10000^(2i/d)) Even dims

PE(pos, 2i+1) = cos(pos / 10000^(2i/d)) Odd dims

6.2.2 Rotary Position Embeddings (RoPE)

RoPE (Su et al., 2021), used in LLaMA, Mistral, and GPT-NeoX, encodes position by rotating query and key vectors in 2D subspaces:

q_m = R_m q, k_n = R_n k RoPE rotation

(R_m q)ᵀ(R_n k) = qᵀ R_{m-n}ᵀ k = f(q, k, m-n) Relative position

The dot product depends only on relative position m-n, not absolute positions. This makes RoPE strictly superior to sinusoidal encoding for relative position sensitivity and enables context length extrapolation.

07

The Attention Mechanism

7.1 Scaled Dot-Product Attention

The attention mechanism computes weighted sums of values, where weights are determined by query-key similarity:

Attention(Q, K, V) = softmax(QKᵀ / √dₖ) · V Scaled dot-product attention

where Q ∈ ℝˢˣᵈᵏ (queries), K ∈ ℝˢˣᵈᵏ (keys), V ∈ ℝˢˣᵈᵥ (values). The scaling factor 1/√d_k prevents the dot products from growing too large — without it, softmax saturates into near-one-hot distributions, causing vanishing gradients.

The attention weight matrix A = softmax(QK^ᵀ/√d_k) ∈ ℝˢˣˢ has entry A_ij representing how much position i attends to position j. The output is a weighted sum of values: output_i = Σ_j A_ij V_j. This allows each position to aggregate information from any other position in the sequence — the defining capability of transformers.

7.2 Multi-Head Attention

Multi-head attention runs h parallel attention operations in lower-dimensional subspaces:

headᵢ = Attention(QWᵢQ, KWᵢK, VWᵢV) Single head

MultiHead(Q,K,V) = Concat(head₁,...,headₕ) Wᴼ Multi-head

Different heads learn different attention patterns: some attend to recent positions (local context), others to syntactic structure, others to semantic relationships. With h=12 heads and d=768 (BERT-base), each head operates in a 64-dimensional subspace.

7.3 Causal (Masked) Attention

A = softmax((QKᵀ/√dₖ) + M), M_ij = 0 if j≤i, -∞ if j>i Causal masking

Adding -∞ to masked positions before softmax sets exp(-∞) = 0, zeroing those attention weights. The resulting attention is lower-triangular: position i attends to positions 1,...,i but not i+1,...,s.

7.4 Computational Complexity

Standard attention is O(s²d) in time and O(s²) in memory due to the s×s attention matrix. Efficient variants include FlashAttention (tiling for 2-4× speedup, exact), sparse attention (O(s·w) for window size w), linear attention (O(s·d²)), and Multi-Query Attention (shares K and V across heads, reducing KV cache size by factor h).

08

The Transformer Architecture

8.1 The Transformer Block

A transformer layer consists of two sublayers: multi-head self-attention and a position-wise MLP, each with residual connection and layer normalization:

x' = x + MultiHead(LN(x), LN(x), LN(x)) Attention sublayer

x'' = x' + MLP(LN(x')) MLP sublayer

The MLP sublayer:

MLP(x) = W₂ · GELU(W₁x + b₁) + b₂ Transformer MLP

where W₁ ∈ ℝ^4d×d and W₂ ∈ ℝ^d×4d. For d=4096, this MLP has ~134 million parameters per layer.

8.2 Parameter Count

Component	Parameters	Notes
Token embedding	d × \|V\|	Weight-tied with output
Positional embedding	d × s_max	Absent in RoPE models
Attention (per layer)	4d²	Q,K,V,O projections
MLP (per layer)	8d²	SwiGLU: 12d²
LayerNorm (per layer)	4d	Negligible
Total (L layers)	≈ 12Ld²	For large d

For GPT-3 (L=96, d=12288): 12 × 96 × 12288² ≈ 175 billion parameters.

8.3 The Residual Stream

The residual connection pattern creates the residual stream — a d-dimensional vector flowing through the entire network, additively modified by each layer:

x₀ = Embed(tokens) + PE Initialize

xₗ = xₗ₋₁ + Attn_l(xₗ₋₁) + MLP_l(...) Update

logits = Unembed(LN(x_L)) Readout

The mechanistic interpretability view treats the residual stream as a communication channel: attention heads and MLP layers read from it and write additive updates, enabling interpretability research into information flow.

09

Training Dynamics at Scale

9.1 The Loss Landscape

The loss function of a neural network L: ℝⁿ → ℝ (where n ~ 10⁹–10¹²) is highly non-convex. Theoretically intractable, yet large overparameterized networks are relatively easy to train due to overparameterization (most local minima are near-global), NTK regime (infinite-width networks converge like kernel regression), and implicit regularization (SGD finds flat, generalizable minima).

9.2 The Chinchilla Scaling Laws

Hoffmann et al. (2022) derived empirical scaling laws relating compute budget C, model size N, and dataset size D:

L(N, D) = E + A/Nᵅ + B/Dᵝ Neural scaling law

where E ≈ 1.69 (irreducible entropy), α ≈ β ≈ 0.34. The key finding: for fixed compute C = 6ND, the optimal trade-off is N_opt ∝ C^0.5 and D_opt ∝ C^0.5 — model size and dataset size should scale equally. This implies GPT-3 (175B params, 300B tokens) was significantly undertrained.

9.3 Mixed Precision Training

Modern models train in mixed precision: forward pass in float16 for memory efficiency, weight updates in float32 for numerical stability.

python Mixed precision training (PyTorch)

# Forward pass in fp16
with torch.autocast(device_type='cuda', dtype=torch.float16):
    logits = model(input_ids)
    loss = cross_entropy(logits, targets)

# Backward pass: gradients in fp16
scaler.scale(loss).backward()

# Gradient unscaling + clipping in fp32
scaler.unscale_(optimizer)
torch.nn.utils.clip_grad_norm_(model.parameters(), max_norm=1.0)

# Weight update in fp32 (master weights)
scaler.step(optimizer)
scaler.update()

BFloat16 is preferred over float16: same 8-bit exponent as float32 (matching dynamic range) but only 7 mantissa bits, avoiding overflow/underflow without loss scaling.

9.4 Distributed Training

Training a 70B model requires ~140 GB GPU memory for weights alone. Strategies include data parallelism (replicate model, sync gradients), tensor parallelism (split weight matrices across GPUs), pipeline parallelism (assign layers to different GPUs), and ZeRO (partition optimizer states, gradients, and weights across workers, enabling arbitrarily large models).

10

Language Modeling & Generation

10.1 The Autoregressive Objective

L(θ) = -E_{x~D}[ Σₜ log P_θ(xₜ | x_{<t}) ] LM objective

At each position t, the model predicts a distribution over the full vocabulary V given all preceding tokens. This simple next-token prediction objective, applied at enormous scale, is sufficient to learn rich representations of language, factual knowledge, and reasoning.

10.2 Decoding Strategies

10.2.1 Temperature Scaling

P(tᵢ) = exp(zᵢ/T) / Σⱼ exp(zⱼ/T) Temperature

T < 1 sharpens (more deterministic), T > 1 flattens (more random). T = 0 reduces to argmax (greedy decoding).

10.2.2 Top-k Sampling

P'(tᵢ) = P(tᵢ) / Σ_{j∈top-k} P(tⱼ) for i ∈ top-k Top-k

10.2.3 Nucleus (Top-p) Sampling

V_p = {t : Σ_{v sorted} P(v) ≤ p}, P'(tᵢ) = P(tᵢ)/p Nucleus sampling

Top-p is adaptive: more tokens when uncertain, fewer when confident. Typical settings: temperature 0.7–1.0, top-p 0.9–0.95.

10.2.4 Beam Search

score(x_{1:t}) = (1/t) Σᵢ log P(xᵢ|x_{<i}) Beam score

Beam search maximizes probability directly. Preferred for translation and structured outputs; sampling is preferred for open-ended generation.

10.3 The Key-Value Cache

During autoregressive inference, the KV cache stores previously computed K and V tensors, making each generation step O(sd) instead of O(s²d), but consuming O(2Lsd) memory — for a 70B model, ~320 GB in float16. KV cache compression is an active research area critical for long-context inference.

python KV Cache inference

kv_cache = {}

def generate_step(input_token, kv_cache, model):
    q, k, v = model.qkv_proj(embed(input_token))

    if layer_idx in kv_cache:
        K = torch.cat([kv_cache[layer_idx][0], k], dim=1)
        V = torch.cat([kv_cache[layer_idx][1], v], dim=1)
    else:
        K, V = k, v

    kv_cache[layer_idx] = (K, V)
    output = scaled_dot_product_attention(q, K, V)
    return output, kv_cache

11

Reinforcement Learning from Human Feedback

11.1 The Three-Stage Pipeline

RLHF aligns pretrained language models with human preferences through three stages: supervised fine-tuning (SFT), reward model training, and RL optimization.

11.2 The Reward Model

A reward model R_φ: (prompt, completion) → ℝ is trained on pairwise human preference data:

L_RM(φ) = -E[ log σ(R_φ(x,y_w) - R_φ(x,y_l)) ] Bradley-Terry loss

11.3 PPO Optimization

The language model π_θ is optimized to maximize expected reward while staying close to the reference policy:

max_θ E[ R_φ(x,y) ] - β · D_KL(π_θ ‖ π_ref) RLHF objective

L_PPO = E_t[ min(rₜ Aₜ, clip(rₜ, 1-ε, 1+ε) Aₜ) ] PPO objective

The KL term prevents reward hacking — the model diverging to exploit weaknesses in the reward model. PPO's clipped surrogate objective prevents excessively large policy updates that would destabilize training.

11.4 Direct Preference Optimization (DPO)

DPO (Rafailov et al., 2023) bypasses the reward model entirely:

L_DPO(θ) = -E[ log σ(β log π_θ(y_w|x)/π_ref(y_w|x) - β log π_θ(y_l|x)/π_ref(y_l|x)) ] DPO loss

DPO is mathematically equivalent to RLHF but simpler, more stable, and requires no separate reward model. Most modern fine-tuning pipelines use DPO or variants (IPO, KTO) instead of PPO.

∞

Summary: Mathematical Building Blocks

Concept	Core Equation	Role in AI
Matrix multiply	y = Wx + b	Every linear layer, attention projection
Softmax	exp(z)/Σexp(z)	Token probabilities, attention weights
Cross-entropy	H(P,Q) = -Σ P log Q	Training loss for classification/LM
Chain rule	∂z/∂x = (∂z/∂y)(∂y/∂x)	Backpropagation through layers
Adam update	θ -= η·m̂/√(v̂+ε)	Parameter updates during training
Dot-product attention	softmax(QKᵀ/√dₖ)V	Core of transformer self-attention
Layer norm	(x-μ)/σ · γ + β	Stabilizes activations per sample
RoPE	(Rₘq)ᵀ(Rₙk)=f(q,k,m-n)	Relative positional encoding
KL divergence	Σ P log(P/Q)	RLHF regularization, VAE objective
Perplexity	exp(H(P,Q))	Language model evaluation metric
SVD	W=UΣVᵀ	Weight compression, LoRA fine-tuning
PPO objective	E[min(r·A, clip(r)·A)]	RLHF policy optimization

References

Vaswani, A., et al. (2017). Attention is all you need. NeurIPS 2017.

He, K., et al. (2016). Deep residual learning for image recognition. CVPR 2016.

Ba, J. L., et al. (2016). Layer normalization. arXiv:1607.06450.

Kingma, D. P., & Ba, J. (2015). Adam: A method for stochastic optimization. ICLR 2015.

Hochreiter, S., & Schmidhuber, J. (1997). Long short-term memory. Neural Computation 9(8).

Ioffe, S., & Szegedy, C. (2015). Batch normalization. ICML 2015.

Elhage, N., et al. (2021). A mathematical framework for transformer circuits. Transformer Circuits Thread.

Hoffmann, J., et al. (2022). Training compute-optimal large language models. arXiv:2203.15556.

Su, J., et al. (2021). RoFormer: Enhanced transformer with rotary position embedding. arXiv:2104.09864.

Dao, T., et al. (2022). FlashAttention: Fast and memory-efficient exact attention. NeurIPS 2022.

Rafailov, R., et al. (2023). Direct preference optimization. NeurIPS 2023.

Zhang, B., & Sennrich, R. (2019). Root mean square layer normalization. NeurIPS 2019.

Schulman, J., et al. (2017). Proximal policy optimization algorithms. arXiv:1707.06347.

Ouyang, L., et al. (2022). Training language models to follow instructions with human feedback. NeurIPS 2022.

Kirchenbauer, J., et al. (2023). A watermark for large language models. ICML 2023.

Rajbhandari, S., et al. (2020). ZeRO: Memory optimizations toward training trillion parameter models. SC 2020.