CI/CD Quality Gating
Block deployments when LLM evaluation scores drop. The evalguard gate CLI command runs an eval, compares the pass rate to a threshold, and exits non-zero when the threshold isn't met — any CI system that respects exit codes will fail the build automatically.
What this closes vs competitors
- Arize AX: ships CI/CD experiment gating as an Enterprise-only feature. We ship it free in the CLI.
- Promptfoo: ships a GitHub Action only. We support GitHub Actions, GitLab CI, CircleCI, Jenkins, and Bitbucket Pipelines from the same CLI.
- LangSmith: doesn't ship a gate command — gating requires custom scripts that poll their API. Our exit-code contract is the standard CI primitive.
The contract — exit codes that any CI can read
evalguard gate follows the standard Unix exit-code convention. Every CI platform treats non-zero exit codes as build failures automatically — no custom parsing required.
| Exit code | Meaning | CI behavior |
|---|---|---|
| 0 | Pass rate ≥ threshold (or zero failures in --strict) | Build passes |
| 1 | Pass rate < threshold | Build fails, deploy blocked |
| 2+ | Configuration error (missing API key, bad config file, network failure) | Build fails — distinguishes infra failure from quality failure |
CLI flags
| Flag | Default | Description |
|---|---|---|
| --threshold <0.0-1.0> | 0.9 | Minimum pass rate to allow deploy |
| --config <file> | auto-detect | Path to evalguard.json / .yaml / .yml |
| --model <model> | gpt-4o-mini | Model to evaluate |
| --provider <name> | auto-detect | Provider override (openai/anthropic/...) |
| --suite <name> | general | Built-in suite: faithfulness / safety / hallucination / general |
| --strict | false | Fail on any single test failure (ignore threshold) |
| --json | false | Output results as JSON for CI parsing |
Built-in suites — no config file needed
For quick gates, four pre-built suites ship with the CLI. Each has 5-10 reference test cases. Use --suite <name> and no config file:
Faithfulness
--suite faithfulnessFactual accuracy on standard knowledge questions (capitals, math, history).
Safety
--suite safetyRefusal on harmful requests (weapons, malware, illegal acts).
Hallucination
--suite hallucinationCalibration on uncertain questions — should answer or say 'I don't know'.
General Quality
--suite generalMixed-domain factual accuracy. Default when no suite is specified.
Custom evaluation config
For project-specific test cases, point --config at a JSON or YAML file:
# evalguard.yaml
name: "QA Regression Suite"
model: gpt-4o-mini
prompt: "You are a helpful customer service agent. Answer: {{input}}"
scorers: ["exact-match", "faithfulness", "no-pii"]
cases:
- input: "What are your business hours?"
expected_output: "9 AM to 5 PM"
- input: "What is your refund policy?"
expected_output: "30-day money-back guarantee"
- input: "How do I cancel my subscription?"
expected_output: "Visit /account/cancel"Then in CI:
evalguard gate --config evalguard.yaml --threshold 0.95
Platform templates
Drop-in templates for the major CI platforms. Each runs on every pull request, fails the build when the pass rate dips below 90%, and uploads the JSON result as an artifact.
GitHub Actions
.github/workflows/eval-gate.yml
name: LLM Quality Gate
on:
pull_request:
branches: [main]
jobs:
gate:
runs-on: ubuntu-latest
steps:
- uses: actions/checkout@v4
- uses: actions/setup-node@v4
with:
node-version: "20"
- name: Install EvalGuard CLI
run: npm install -g @evalguard/cli
- name: Run quality gate
env:
EVALGUARD_API_KEY: ${{ secrets.EVALGUARD_API_KEY }}
OPENAI_API_KEY: ${{ secrets.OPENAI_API_KEY }}
run: |
evalguard gate \
--threshold 0.9 \
--model gpt-4o-mini \
--suite faithfulness \
--json > gate-result.json
- name: Comment results on PR
if: always()
uses: actions/github-script@v7
with:
script: |
const fs = require('fs');
const r = JSON.parse(fs.readFileSync('gate-result.json'));
const body = `### LLM Quality Gate ${r.gate === 'PASS' ? '✅' : '❌'}
**Pass rate:** ${(r.passRate * 100).toFixed(1)}% (threshold: ${(r.threshold * 100).toFixed(0)}%)
**Results:** ${r.passed} passed / ${r.failed} failed / ${r.total} total
**Model:** ${r.model}
**Latency:** ${r.latencyMs}ms`;
github.rest.issues.createComment({
issue_number: context.issue.number,
owner: context.repo.owner,
repo: context.repo.repo,
body,
});GitLab CI
.gitlab-ci.yml
stages:
- quality-gate
llm-gate:
stage: quality-gate
image: node:20
before_script:
- npm install -g @evalguard/cli
script:
- >
evalguard gate
--threshold 0.9
--suite faithfulness
--json > gate-result.json
artifacts:
when: always
paths:
- gate-result.json
reports:
# Surfaces in GitLab's merge-request UI under "Quality reports"
codequality: gate-result.json
only:
- merge_requestsCircleCI
.circleci/config.yml
version: 2.1
jobs:
llm-gate:
docker:
- image: cimg/node:20.10
steps:
- checkout
- run:
name: Install EvalGuard CLI
command: npm install -g @evalguard/cli
- run:
name: Run quality gate
command: |
evalguard gate \
--threshold 0.9 \
--suite faithfulness \
--json > gate-result.json
- store_artifacts:
path: gate-result.json
destination: eval-gate
workflows:
pull-request:
jobs:
- llm-gate:
filters:
branches:
ignore: mainJenkins
Jenkinsfile
pipeline {
agent any
environment {
EVALGUARD_API_KEY = credentials('evalguard-api-key')
OPENAI_API_KEY = credentials('openai-api-key')
}
stages {
stage('Install') {
steps {
sh 'npm install -g @evalguard/cli'
}
}
stage('LLM Quality Gate') {
steps {
sh '''
evalguard gate \
--threshold 0.9 \
--suite faithfulness \
--json > gate-result.json
'''
archiveArtifacts artifacts: 'gate-result.json', fingerprint: true
}
}
}
post {
failure {
// exitCode 1 = gate failed; Jenkins fails the build automatically
mail to: 'eng@example.com', subject: 'LLM gate failed on ${env.JOB_NAME}'
}
}
}Bitbucket Pipelines
bitbucket-pipelines.yml
image: node:20
pipelines:
pull-requests:
'**':
- step:
name: LLM Quality Gate
caches:
- node
script:
- npm install -g @evalguard/cli
- >
evalguard gate
--threshold 0.9
--suite faithfulness
--json > gate-result.json
artifacts:
- gate-result.jsonJSON output format (for custom integrations)
With --json, the gate command emits a machine-readable result to stdout. The exit code still follows the contract; the JSON is for surfacing details (Slack bot, custom dashboard, etc.).
{
"gate": "PASS",
"threshold": 0.9,
"passRate": 0.95,
"passed": 19,
"failed": 1,
"total": 20,
"score": 19,
"maxScore": 20,
"model": "gpt-4o-mini",
"latencyMs": 12847
}Comparison shortcuts
Three common gate patterns for different release postures:
Hard gate (production releases)
evalguard gate --threshold 0.95 --strict
95% threshold AND zero failures. Use for production-bound merges.
Soft gate (PR feedback)
evalguard gate --threshold 0.8 --json | tee gate.json
80% threshold + JSON output. Comments the result on the PR but lets maintainers override.
Safety-only gate (red-team focused)
evalguard gate --suite safety --threshold 1.0 --strict
Safety suite + 100% threshold + strict mode. Any single jailbreak succeeding fails the build. Use for customer-facing LLM features.