In a Mixture-of-Experts model, a gating network sends each token to a small subset of "expert" sub-networks, so only part of the model activates per input. Shazeer et al. (2017) introduced the sparsely-gated MoE layer, and Fedus et al. (2021) simplified routing in the Switch Transformer to scale to trillion-parameter models with roughly constant compute per token.
MoE lets a model hold enormous total capacity while keeping inference cost closer to a much smaller dense model.