Concept · Evaluation modes
Basic vs deep evaluation
EvalGuard ships two evaluator depths. They share the same dimensions (safety, fairness, accuracy, reliability, transparency, privacy, accountability, user-impact) but differ in cost, latency, and what they can detect. Picking the right one for the right traffic class is the single biggest knob on your eval bill.
Basic scorers — keyword + regex + ML classifier
Run locally, no LLM call. Sub-millisecond per dimension. Bias check is a learned classifier; toxicity uses Perspective API patterns; PII leakage is regex + Aadhaar/PAN/UPI validators; reliability checks are structural (JSON-valid, regex-match, length-check). Cost in tokens: zero. Cost in dollars: zero beyond your own compute.
Right for: CI-time gating, dataset-versioning checks, batch evals across millions of rows, anywhere you'd accept some false negatives in exchange for throughput.
Deep scorers — LLM-as-judge with per-criterion rubric
Each dimension's deep config (biasDeepConfig, toxicDeepConfig, faithfulnessDeepConfig, etc.) is a structured prompt that asks an LLM to grade against 3–7 explicit criteria. Returns a per-criterion pass/fail/partial + an overall 0–1 score + a 1-sentence reasoning string. Catches nuanced bias the basic classifier misses ("female candidates often need extra support" — a textbook gender claim).
Cost: 1 LLM call per dimension per evaluation. With gpt-4o-mini at the default 600-token cap, that's ~$0.0003 per dim. Three dims at basic-set parity = ~$0.001 per evaluation. Latency: 2–5 seconds per dim (parallelisable).
Right for: production response gating, regulator-facing audits, any eval where a single false negative is more expensive than three LLM calls.
Picking a default
The POST /api/v1/evals/safe-regenerate endpoint takes a scorerSet field with values basic or deep. basic gates on safety + fairness + accuracy; deep adds reliability, transparency, privacy, accountability, user-impact (all 8 pillars).
For new integrations, start on deep and tune down only if you measure the cost mattering — the gateway's basic firewall already does the cheap pre-filter inline, so by the time a response reaches safe-regenerate you've already paid the rendering cost. The extra $0.001 per call buys real bias detection.
What this maps to under the hood
- Basic scorers live in
packages/core/src/scorers/. 188 slugs, deep-linked from the scorer catalog. - Deep scorers live in
packages/core/src/security/graders/deep/andpackages/core/src/scorers/deep/— 10 graders + 30 deep configs. Imported from@evalguard/coreat the package root (re-export added in commit105b91c6; live-fire surfaced the gap when every call was falsely returning score=1).
Related concepts
- Scoring thresholds — how the 0–1 dim scores combine into a pass/fail verdict.
- The regeneration loop — how a failing eval triggers a retry rather than a hard block.
- Firewall vs scorer — why both exist and where each fires.