Concept · Scoring thresholds

Scoring thresholds

Every EvalGuard evaluator returns a score in [0, 1] per dimension. A response is judged against a threshold — pass if the score meets or exceeds it, fail otherwise. The whole product comes down to picking the right thresholds for the right traffic.

The 0–1 scale

Scores are notprobabilities. They’re calibrated grades from the underlying scorer: an ML classifier returns its confidence in “this output is safe”; an LLM-as-judge returns a 0–1 derived from per-criterion pass/fail counts; a regex check returns 1.0 if no match and 0.0 if any match. The contract is uniform — every scorer must return a number in [0, 1] — but the meaning is per-dim.

For display, the EvalGuard dashboard renders scores as percentages — 0.75 shows as “75%”. The one exception is the public “Try the evaluator” demo on /try, which shows a 0–10 scale (“7.5 / 10”) for at-a-glance familiarity. The core evaluation API (/api/v1) returns the raw 0–1 score per dimension; only the demo’s /api/public/try endpoint returns its headline overallScore pre-scaled to 0–10.

The 0.8 default threshold

Most endpoints default to threshold: 0.8. It’s not a guess — it’s the empirical breakpoint where the deep grader’s false-positive rate drops sharply on our internal eval set. Below 0.8 you start flagging legitimate edge-case responses; above 0.85 you start missing real failures. The narrow band is real; we’d rather you tune it than leave it.

MIN-of-dims gate semantics

When an evaluation covers multiple dimensions (safety + fairness + accuracy, say), the overall verdict is the MIN of the reported dimension scores, not the mean. One dim below threshold fails the whole response.

gate semantics

overall = min(safety, fairness, accuracy)
verdict = overall >= threshold ? "pass" : "fail"

Why MIN and not weighted average: averaging lets a strong score on one dim mask a critical failure on another. A response that’s 1.0 safe + 0.3 fair averages to 0.65 — well below threshold — but a response that’s 1.0 safe + 1.0 accurate + 0.3 fair averages to 0.77, which would barely pass. The bias failure didn’t get less serious because the other dims happened to be fine. MIN forces every dim to clear independently.

Calibrating for your domain

The default 0.8 is for general-purpose chat. For specific verticals:

Healthcare patient-facing: tighten safety + accuracy to 0.9, leave fairness at 0.8. A medical hallucination is more expensive than a regulatory finding.
Internal developer tooling: loosen to 0.7 across the board. Engineers can self-correct; the goal is unblock-the-flow.
Public-facing customer support: safety + fairness at 0.85, accuracy at 0.75. Tone matters more than fact-perfect.
Compliance evidence (DPDP/GDPR): set per-section thresholds via the policy engine — S.6 consent attestation needs 1.0, while S.11 sensitive-PD scan can default to 0.85.

Calibrate by sampling 100 production responses, running them through your candidate threshold, and reviewing the false positives and false negatives manually. Move the threshold by 0.05 increments. Most teams land within 0.05 of 0.8 after two iterations.

Per-dimension thresholds

The default flat threshold is the easy path. For production rigor, every endpoint that takes threshold also accepts dimensions as an array, letting you gate only on dims that matter for your use case. For per-dim thresholds, use the policy engine — it lets you declare rules like “block if safety < 0.9 OR fairness < 0.85” without changing the eval call shape.

Related concepts

Evaluation modes — the scorers behind the scores.
The regeneration loop — what happens after a fail.
Policy engine — per-dim thresholds + action mapping.