Concept · Agent checkpoints

Agent checkpoints

A single-turn chat surface has one place to enforce safety — between the LLM’s response and the user. An agent has three. Each catches a different class of failure, and skipping any of them leaves a known attack vector open. EvalGuard exposes the three as composable API surfaces you wire into your own agent loop — the firewall scan, the agent-policy engine, and the gateway proxy.

Checkpoint 1 — input injection scan

Where: before the agent’s first LLM call, applied to whatever the user (or upstream caller) supplied.

What it catches: prompt injection— adversarial text designed to override the system prompt, exfiltrate secrets, or hijack the agent’s downstream actions. “Ignore previous instructions and ...” is the cartoon version; real attacks hide in Markdown, JSON values, URLs the agent will dereference, file contents the agent loads.

Implementation: pass the user input through the deep injection grader (jailbreakSuccessDeepConfig) plus the firewall’s sub-3ms inline pattern check. A combined verdict above the policy threshold ⇒ block before any tool budget is spent.

Checkpoint 2 — tool-call gate

Where: between the agent deciding to call a tool and the tool actually running.

What it catches: off-policy actions — the agent decided to delete a file, send an email, transfer funds, hit an external API. Even if the user input was clean and the model is well-aligned, the action might still be out of scope for what the agent is allowed to do.

Implementation: gate each tool call by name + args against the agent-policy engine (the agent_policies table, served by /api/v1/gateway/policies). Each matching rule has an effect of allow or deny, plus an orthogonal requiresApproval flag the rule can set. The engine returns a PolicyDecision — allowed (boolean) and requiresApproval (boolean):

allowed: true — proceed to execution.
allowed: true, requiresApproval: true — the rule allows the call but flags it for human-in-the-loop. Your application pauses and routes to a human reviewer before executing. This is a boolean on the decision, not a separate verdict. Set on a rule via conditions.requireApproval = true — a sensible default for any send_email, delete_*, transfer_* tool.
allowed: false— the call is denied; return a refusal to the agent and log to audit. Default for anything outside the org’s policy set (deny-by-default = zero trust).

Checkpoint 3 — tool-result scan

Where: after the tool returns, before its result is fed back into the agent loop.

What it catches: poisoned tool outputs— a search result that contains injection, an email with hostile reply-text, a fetched URL that returns adversarial HTML. The user’s input was clean, the tool call was on-policy, but the world fed something nasty back. Without this checkpoint that nasty payload becomes part of the next LLM turn’s context.

Implementation: the result string passes through the same injection grader as input, plus a PII redact pass (so any secret an external service accidentally returned doesn’t end up in the model’s context window). Redactions happen in place; the agent sees a sanitized result that’s safe to reason over.

What you wire up

There is no single SDK agent wrapper — you compose the three checkpoints from the API surfaces that genuinely exist. Each is a plain HTTP call you place at the corresponding point in your own agent loop. The illustrative wiring (pseudocode, not a real import):

agent loop (pseudocode)

// Checkpoint 1 — scan the user input before the first LLM call.
await fetch("https://evalguard.ai/api/v1/firewall/check", {
  method: "POST",
  headers: { Authorization: "Bearer eg_live_...", "Content-Type": "application/json" },
  body: JSON.stringify({ input: userInput }),
}); // -> { blocked, category, score, hits[] }

// Checkpoint 2 — gate each tool call against the agent-policy engine.
await fetch("https://evalguard.ai/api/v1/gateway/policies", {
  method: "POST",
  headers: { Authorization: "Bearer eg_live_...", "Content-Type": "application/json" },
  body: JSON.stringify({
    action: "evaluate",
    projectId: "proj_...",
    agent: "support-agent",
    request: { tool: "send_email", action: "send", resource: "..." },
  }),
}); // -> PolicyDecision { allowed, requiresApproval, reason }

// Checkpoint 3 — re-scan the tool result before feeding it back to the model.
await fetch("https://evalguard.ai/api/v1/firewall/check", {
  method: "POST",
  headers: { Authorization: "Bearer eg_live_...", "Content-Type": "application/json" },
  body: JSON.stringify({ input: toolResult }),
});

When you proxy your LLM traffic through the AI gateway (/api/v1/gateway/proxy/...), pass an x-evalguard-run-idheader so each checkpoint’s audit row threads back to the same agent run. See the API reference for the full request/response shapes.

Why all three matter

The three checkpoints catch different attacks because the trust boundary is different at each stage. Input is from the user (untrusted). Tool calls are from the agent (untrusted output of a trusted model). Tool results are from the world (untrusted external). Treating any of those as trusted is the gap a serious attacker walks through.

A red-team result from our 2026-05-05 NeMo head-to-head: a multi-turn agent attack that bypassed every input-only firewall in our test set was caught only by checkpoint 3 (the tool-result scan saw the injected payload in a fetched URL). Skip the third checkpoint and that family of attacks is invisible.

Related concepts

Policy engine — what rules each checkpoint evaluates against.
Firewall vs scorer — inline vs LLM-judged at each checkpoint.
Red teaming — how to test all three checkpoints with adversarial inputs.