Category:Evaluation
From llmref.wiki
This category covers evaluation and benchmarking — how language-model and retrieval-system quality is measured, and the integrity threats to that measurement: faithfulness and groundedness, benchmark contamination, LLM-as-judge, and golden datasets.
Pages in category "Evaluation"
The following 7 pages are in this category, out of 7 total.