BLEU, introduced by Papineni et al. (2002), compares a machine translation against one or more human references using overlapping word n-grams, with a penalty for output that is too short. It was the first automatic metric to correlate reasonably with human judgement of translation quality and became a field standard.
BLEU is fast and reproducible but blind to meaning and paraphrase, so it is used alongside, not instead of, human evaluation.