41. Judge language model outputs with real evidence
Evaluate generated answers for accuracy, faithfulness, relevance, style, safety, and task success. You will build test sets, rubrics, human review loops, and automated judges without trusting them blindly.