Concept · Evaluation modes

Basic vs deep evaluation

EvalGuard ships two evaluator depths. They share the same dimensions (safety, fairness, accuracy, reliability, transparency, privacy, accountability, user-impact) but differ in cost, latency, and what they can detect. Picking the right one for the right traffic class is the single biggest knob on your eval bill.

Basic scorers — keyword + regex + ML classifier

Run locally, no LLM call. Sub-millisecond per dimension. Bias check is a regex/keyword stereotype-pattern heuristic (a learned LLM-judge is available in deep mode); toxicity uses Perspective API patterns; PII leakage is regex + Aadhaar/PAN/UPI validators; reliability checks are structural (JSON-valid, regex-match, length-check). Cost in tokens: zero. Cost in dollars: zero beyond your own compute.

Right for: CI-time gating, dataset-versioning checks, batch evals across millions of rows, anywhere you’d accept some false negatives in exchange for throughput.

Deep scorers — LLM-as-judge with per-criterion rubric

Each dimension’s deep config (biasDeepConfig, toxicDeepConfig, faithfulnessDeepConfig, etc.) is a structured prompt that asks an LLM to grade against 3–7 explicit criteria. Returns a per-criterion pass/fail/partial + an overall 0–1 score + a 1-sentence reasoning string. Catches nuanced bias the basic classifier misses (“female candidates often need extra support” — a textbook gender claim).

Cost: 1 LLM call per dimension per evaluation. With gpt-4o-miniat the default 600-token cap, that’s ~$0.0003 per dim. Three dims at basic-set parity = ~$0.001 per evaluation. Latency: 2–5 seconds per dim (parallelisable).

Right for: production response gating, regulator-facing audits, any eval where a single false negative is more expensive than three LLM calls.

Picking a default

The POST /api/v1/evals/safe-regenerate endpoint takes a scorerSet field with values basic or deep. basic gates on safety + fairness + accuracy; deep adds reliability, transparency, privacy, accountability, user-impact (all 8 pillars).

For new integrations, start on deepand tune down only if you measure the cost mattering — the gateway’s basic firewall already does the cheap pre-filter inline, so by the time a response reaches safe-regenerate you’ve already paid the rendering cost. The extra $0.001 per call buys real bias detection.

What this maps to under the hood

Basic mode draws on 200+ scorers, each addressable by its own slug and deep-linked from the scorer catalog.
Deep mode adds 10 LLM-judge graders and 29 deep configs (biasDeepConfig, toxicDeepConfig, …), all importable directly from @evalguard/core.

Related concepts

Scoring thresholds — how the 0–1 dim scores combine into a pass/fail verdict.
The regeneration loop — how a failing eval triggers a retry rather than a hard block.
Firewall vs scorer — why both exist and where each fires.