Category:Evaluation

From llmref.wiki

This category covers evaluation and benchmarking — how language-model and retrieval-system quality is measured, and the integrity threats to that measurement: faithfulness and groundedness, benchmark contamination, LLM-as-judge, and golden datasets.

Pages in category "Evaluation"

The following 7 pages are in this category, out of 7 total.