31. Evaluate answers, actions, and whole tasks
Measure factual quality, tool choice, task completion, safety, latency, cost, and user satisfaction. You will build eval sets, rubrics, golden tasks, judge models, and regression tests for agent releases.