Perplexity is the exponential of a model's average per-token cross-entropy on a test set — intuitively, the effective number of equally likely choices the model is deciding among at each step. The measure traces to early speech-recognition research (Jelinek et al., 1977) and is treated as a core evaluation metric in Jurafsky & Martin's standard text.
Lower is better, but perplexity only measures predictive fit on text; it does not directly capture helpfulness, factuality or safety, which is why it is paired with task benchmarks and human evaluation.