Category:Evaluation

From llmref.wiki

This category covers evaluation and benchmarking — how language-model and retrieval-system quality is measured, and the integrity threats to that measurement: faithfulness and groundedness, benchmark contamination, LLM-as-judge, and golden datasets.

Pages in category "Evaluation"

The following 7 pages are in this category, out of 7 total.

B

Benchmark contamination

F

G

Golden dataset

L

LLM-as-judge

M

Model card

R

Retrieval precision and recall

Retrieved from "https://llmref.wiki/index.php?title=Category:Evaluation&oldid=11"

Llmref.wiki