Scorers
87 built-in scorers across 12 categories for evaluating LLM outputs.
Categories
Using Scorers
Scorers are specified by name in your eval config file or SDK call.
{
"name": "my-eval",
"model": "gpt-4o",
"prompt": "Answer: {{input}}",
"scorers": ["exact-match", "faithfulness", "toxicity", "cost"],
"cases": [
{ "input": "What is 2+2?", "expectedOutput": "4" }
]
}Text Matching
Deterministic string comparison scorers.
exact-matchExact string equality between output and expected
equalsEquality check with optional normalization
containsCheck if output contains expected string
contains-anyCheck if output contains any of the expected strings
contains-allCheck if output contains all expected strings
icontainsCase-insensitive contains
icontains-anyCase-insensitive contains any
icontains-allCase-insensitive contains all
starts-withCheck if output starts with expected prefix
ends-withCheck if output ends with expected suffix
regex-matchMatch output against a regular expression
levenshteinLevenshtein edit distance between output and expected
word-countCheck output word count against min/max bounds
length-checkCheck output character length against bounds
Semantic
Embedding-based and LLM-based semantic comparison.
similarFuzzy string similarity (cosine, Jaccard, etc.)
semantic-similarityCosine similarity between embedding vectors
embedding-distanceDistance between output and expected embeddings
select-bestLLM selects the best output from multiple candidates
LLM-Based
LLM-as-judge scorers for complex quality assessments.
llm-graderCustom LLM grading with your own rubric
g-evalG-Eval framework for arbitrary evaluation criteria
faithfulnessDoes the output faithfully represent the source?
relevanceIs the output relevant to the input query?
answer-relevanceRelevance of the answer to the specific question
factualityAre the claims in the output factually correct?
hallucinationDoes the output contain hallucinated information?
summarizationQuality of a summarization against the source
classifierClassify output into custom categories
coherenceLogical flow and coherence of the output
fluencyGrammatical fluency and naturalness
completenessDoes the output fully address the query?
concisenessIs the output appropriately concise?
readabilityReadability score (Flesch-Kincaid, etc.)
arena-g-evalArena-style pairwise G-Eval comparison
general-task-completionGeneral task completion assessment
JSON & Structured
Validate structured output formats.
json-validIs the output valid JSON?
json-schemaDoes the output match a JSON schema?
json-correctnessSemantic correctness of JSON output
contains-jsonDoes the output contain a JSON block?
contains-sqlDoes the output contain SQL?
contains-xmlDoes the output contain XML?
contains-htmlDoes the output contain HTML?
is-sqlIs the entire output valid SQL?
is-htmlIs the entire output valid HTML?
is-valid-function-callIs the output a valid function call?
NLP Metrics
Traditional NLP evaluation metrics.
rouge-nROUGE-N overlap between output and reference
bleuBLEU score for translation / generation quality
gleuGoogle-BLEU variant score
meteorMETEOR score with synonymy and stemming
perplexityLanguage model perplexity of the output
MCP & Agentic
Scorers for multi-step agent and MCP tool use evaluation.
tool-correctnessDid the agent use the correct tools?
task-completionDid the agent complete the assigned task?
mcp-task-completionMCP-specific task completion metric
mcp-useCorrectness of MCP tool invocations
multi-turn-mcp-useMulti-turn MCP tool use evaluation
goal-accuracyDid the agent achieve the stated goal?
step-efficiencyHow efficiently did the agent solve the task?
plan-adherenceDid the agent follow the expected plan?
plan-qualityQuality of the agent's planning
argument-correctnessCorrectness of function call arguments
dag-evaluationDAG-based evaluation of multi-step workflows
Conversation
Multi-turn conversation evaluation.
conversationOverall conversation quality assessment
conversation-relevanceTurn-by-turn relevance in conversations
conversation-completenessDid the conversation address all topics?
knowledge-retentionDoes the model retain context across turns?
role-adherenceDoes the model stay in its assigned role?
role-violationDid the model break its role constraints?
RAG
Retrieval-augmented generation quality metrics.
context-faithfulnessIs the output faithful to retrieved context?
context-relevanceIs the retrieved context relevant to the query?
context-recallDid the retriever find all relevant passages?
context-precisionAre retrieved passages precise and non-redundant?
Safety
Safety and content moderation scorers.
toxicityToxicity level of the output
biasBias detection in the output
non-adviceDoes the output avoid giving dangerous advice?
misuseCould the output enable misuse?
is-refusalDid the model appropriately refuse a harmful request?
Multimodal
Image and multimodal output evaluation.
text-to-imageQuality of text-to-image generation
image-coherenceVisual coherence of generated images
image-helpfulnessHelpfulness of image-based outputs
image-referenceAccuracy against a reference image
image-editingQuality of image editing operations
Performance
Cost, latency, and observability metrics.
costToken cost of the LLM call
latencyResponse latency in milliseconds
trace-span-countNumber of spans in a trace
trace-span-durationDuration of trace spans
trace-error-spansCount of error spans in a trace
Custom
Bring your own scoring logic.
custom-functionRun a custom JavaScript/TypeScript function
webhookCall an external webhook for scoring
Custom Scorers
Create custom scorers using the custom-function scorer or the webhook scorer.
Custom Function
{
"scorers": ["custom-function"],
"scorerOptions": {
"custom-function": {
"function": "return output.includes('Paris') ? 1.0 : 0.0"
}
}
}Webhook Scorer
{
"scorers": ["webhook"],
"scorerOptions": {
"webhook": {
"url": "https://your-server.com/score",
"headers": { "Authorization": "Bearer your-token" }
}
}
}