Skip to content
ai

Evaluation

LLM Evaluation

Definition

LLM evaluation encompasses the methods and metrics used to measure model quality across dimensions including accuracy, safety, instruction following, and reasoning. Evaluation combines automated benchmarks (MMLU, HumanEval), reference-based metrics (BLEU, ROUGE), model-based judging (LLM-as-judge), and human preference studies.

Robust evaluation is essential for guiding training decisions and detecting capability regressions.


Ship secure code faster

Crash Override integrates security into the developer workflow. No context switching, no waiting on reviews.