Concept · Agent checkpoints

Agent checkpoints

A single-turn chat surface has one place to enforce safety — between the LLM's response and the user. An agent has three. Each catches a different class of failure, and skipping any of them leaves a known attack vector open. EvalGuard's agent layer wraps a model + tools with all three.

Checkpoint 1 — input injection scan

Where: before the agent's first LLM call, applied to whatever the user (or upstream caller) supplied.

What it catches: prompt injection — adversarial text designed to override the system prompt, exfiltrate secrets, or hijack the agent's downstream actions. "Ignore previous instructions and ..." is the cartoon version; real attacks hide in Markdown, JSON values, URLs the agent will dereference, file contents the agent loads.

Implementation: pass the user input through the deep injection grader (jailbreakSuccessDeepConfig) plus the firewall's sub-3ms inline pattern check. A combined verdict above the policy threshold ⇒ block before any tool budget is spent.

Checkpoint 2 — tool-call gate

Where: between the agent deciding to call a tool and the tool actually running.

What it catches: off-policy actions — the agent decided to delete a file, send an email, transfer funds, hit an external API. Even if the user input was clean and the model is well-aligned, the action might still be out of scope for what the agent is allowed to do.

Implementation: gate each tool call by name + args against a policy. Allowed tools pass through transparently; flagged tools require a human approval (paused agent state); blocked tools fail the call. The gate has three verdicts:

  • allow — proceed to execution.
  • flag — pause, write to agent_pending_approvals, wait for human-in-the-loop. Default for any send_email, delete_*, transfer_* tool.
  • block — return a refusal to the agent, log to audit. Default for anything outside the project's allowlist.

Checkpoint 3 — tool-result scan

Where: after the tool returns, before its result is fed back into the agent loop.

What it catches: poisoned tool outputs — a search result that contains injection, an email with hostile reply-text, a fetched URL that returns adversarial HTML. The user's input was clean, the tool call was on-policy, but the world fed something nasty back. Without this checkpoint that nasty payload becomes part of the next LLM turn's context.

Implementation: the result string passes through the same injection grader as input, plus a PII redact pass (so any secret an external service accidentally returned doesn't end up in the model's context window). Redactions happen in place; the agent sees a sanitized result that's safe to reason over.

What you wire up

The @evalguard/sdk agent helper wraps your tool registry with all three checkpoints. Pseudocode:

import { agentLoop } from "@evalguard/sdk";

const result = await agentLoop({
  model: "openai/gpt-4o-mini",
  tools: {
    send_email: { handler: sendEmailFn, policy: "flag" },
    search:     { handler: searchFn,    policy: "allow" },
    delete:     { handler: deleteFn,    policy: "block" },
  },
  prompt: userInput,
  checkpoints: ["input", "tool-call", "tool-result"], // all three by default
});

For HTTP-level integration without the SDK, the three checkpoints each have their own API endpoint — firewall (input), the gateway's tool-call hook, and the audit-log writer. See the API reference for the full surface.

Why all three matter

The three checkpoints catch different attacks because the trust boundary is different at each stage. Input is from the user (untrusted). Tool calls are from the agent (untrusted output of a trusted model). Tool results are from the world (untrusted external). Treating any of those as trusted is the gap a serious attacker walks through.

A red-team result from our 2026-05-05 NeMo head-to-head: a multi-turn agent attack that bypassed every input-only firewall in our test set was caught only by checkpoint 3 (the tool-result scan saw the injected payload in a fetched URL). Skip the third checkpoint and that family of attacks is invisible.

Related concepts