Skip to content

CI/CD Quality Gating

Block deployments when LLM evaluation scores drop. The evalguard gate CLI command runs an eval, compares the pass rate to a threshold, and exits non-zero when the threshold isn't met — any CI system that respects exit codes will fail the build automatically.

What this closes vs competitors

  • Arize AX: ships CI/CD experiment gating as an Enterprise-only feature. We ship it free in the CLI.
  • Promptfoo: ships a GitHub Action only. We support GitHub Actions, GitLab CI, CircleCI, Jenkins, and Bitbucket Pipelines from the same CLI.
  • LangSmith: doesn't ship a gate command — gating requires custom scripts that poll their API. Our exit-code contract is the standard CI primitive.

The contract — exit codes that any CI can read

evalguard gate follows the standard Unix exit-code convention. Every CI platform treats non-zero exit codes as build failures automatically — no custom parsing required.

Exit codeMeaningCI behavior
0Pass rate ≥ threshold (or zero failures in --strict)Build passes
1Pass rate < thresholdBuild fails, deploy blocked
2+Configuration error (missing API key, bad config file, network failure)Build fails — distinguishes infra failure from quality failure

CLI flags

FlagDefaultDescription
--threshold <0.0-1.0>0.9Minimum pass rate to allow deploy
--config <file>auto-detectPath to evalguard.json / .yaml / .yml
--model <model>gpt-4o-miniModel to evaluate
--provider <name>auto-detectProvider override (openai/anthropic/...)
--suite <name>generalBuilt-in suite: faithfulness / safety / hallucination / general
--strictfalseFail on any single test failure (ignore threshold)
--jsonfalseOutput results as JSON for CI parsing

Built-in suites — no config file needed

For quick gates, four pre-built suites ship with the CLI. Each has 5-10 reference test cases. Use --suite <name> and no config file:

Faithfulness

--suite faithfulness

Factual accuracy on standard knowledge questions (capitals, math, history).

Safety

--suite safety

Refusal on harmful requests (weapons, malware, illegal acts).

Hallucination

--suite hallucination

Calibration on uncertain questions — should answer or say 'I don't know'.

General Quality

--suite general

Mixed-domain factual accuracy. Default when no suite is specified.

Custom evaluation config

For project-specific test cases, point --config at a JSON or YAML file:

# evalguard.yaml
name: "QA Regression Suite"
model: gpt-4o-mini
prompt: "You are a helpful customer service agent. Answer: {{input}}"
scorers: ["exact-match", "faithfulness", "no-pii"]
cases:
  - input: "What are your business hours?"
    expected_output: "9 AM to 5 PM"
  - input: "What is your refund policy?"
    expected_output: "30-day money-back guarantee"
  - input: "How do I cancel my subscription?"
    expected_output: "Visit /account/cancel"

Then in CI:

evalguard gate --config evalguard.yaml --threshold 0.95

Platform templates

Drop-in templates for the major CI platforms. Each runs on every pull request, fails the build when the pass rate dips below 90%, and uploads the JSON result as an artifact.

GitHub Actions

.github/workflows/eval-gate.yml

name: LLM Quality Gate

on:
  pull_request:
    branches: [main]

jobs:
  gate:
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v4

      - uses: actions/setup-node@v4
        with:
          node-version: "20"

      - name: Install EvalGuard CLI
        run: npm install -g @evalguard/cli

      - name: Run quality gate
        env:
          EVALGUARD_API_KEY: ${{ secrets.EVALGUARD_API_KEY }}
          OPENAI_API_KEY: ${{ secrets.OPENAI_API_KEY }}
        run: |
          evalguard gate \
            --threshold 0.9 \
            --model gpt-4o-mini \
            --suite faithfulness \
            --json > gate-result.json

      - name: Comment results on PR
        if: always()
        uses: actions/github-script@v7
        with:
          script: |
            const fs = require('fs');
            const r = JSON.parse(fs.readFileSync('gate-result.json'));
            const body = `### LLM Quality Gate ${r.gate === 'PASS' ? '✅' : '❌'}

            **Pass rate:** ${(r.passRate * 100).toFixed(1)}% (threshold: ${(r.threshold * 100).toFixed(0)}%)
            **Results:** ${r.passed} passed / ${r.failed} failed / ${r.total} total
            **Model:** ${r.model}
            **Latency:** ${r.latencyMs}ms`;
            github.rest.issues.createComment({
              issue_number: context.issue.number,
              owner: context.repo.owner,
              repo: context.repo.repo,
              body,
            });

GitLab CI

.gitlab-ci.yml

stages:
  - quality-gate

llm-gate:
  stage: quality-gate
  image: node:20
  before_script:
    - npm install -g @evalguard/cli
  script:
    - >
      evalguard gate
      --threshold 0.9
      --suite faithfulness
      --json > gate-result.json
  artifacts:
    when: always
    paths:
      - gate-result.json
    reports:
      # Surfaces in GitLab's merge-request UI under "Quality reports"
      codequality: gate-result.json
  only:
    - merge_requests

CircleCI

.circleci/config.yml

version: 2.1

jobs:
  llm-gate:
    docker:
      - image: cimg/node:20.10
    steps:
      - checkout
      - run:
          name: Install EvalGuard CLI
          command: npm install -g @evalguard/cli
      - run:
          name: Run quality gate
          command: |
            evalguard gate \
              --threshold 0.9 \
              --suite faithfulness \
              --json > gate-result.json
      - store_artifacts:
          path: gate-result.json
          destination: eval-gate

workflows:
  pull-request:
    jobs:
      - llm-gate:
          filters:
            branches:
              ignore: main

Jenkins

Jenkinsfile

pipeline {
  agent any
  environment {
    EVALGUARD_API_KEY = credentials('evalguard-api-key')
    OPENAI_API_KEY = credentials('openai-api-key')
  }
  stages {
    stage('Install') {
      steps {
        sh 'npm install -g @evalguard/cli'
      }
    }
    stage('LLM Quality Gate') {
      steps {
        sh '''
          evalguard gate \
            --threshold 0.9 \
            --suite faithfulness \
            --json > gate-result.json
        '''
        archiveArtifacts artifacts: 'gate-result.json', fingerprint: true
      }
    }
  }
  post {
    failure {
      // exitCode 1 = gate failed; Jenkins fails the build automatically
      mail to: 'eng@example.com', subject: 'LLM gate failed on ${env.JOB_NAME}'
    }
  }
}

Bitbucket Pipelines

bitbucket-pipelines.yml

image: node:20

pipelines:
  pull-requests:
    '**':
      - step:
          name: LLM Quality Gate
          caches:
            - node
          script:
            - npm install -g @evalguard/cli
            - >
              evalguard gate
              --threshold 0.9
              --suite faithfulness
              --json > gate-result.json
          artifacts:
            - gate-result.json

JSON output format (for custom integrations)

With --json, the gate command emits a machine-readable result to stdout. The exit code still follows the contract; the JSON is for surfacing details (Slack bot, custom dashboard, etc.).

{
  "gate": "PASS",
  "threshold": 0.9,
  "passRate": 0.95,
  "passed": 19,
  "failed": 1,
  "total": 20,
  "score": 19,
  "maxScore": 20,
  "model": "gpt-4o-mini",
  "latencyMs": 12847
}

Comparison shortcuts

Three common gate patterns for different release postures:

Hard gate (production releases)

evalguard gate --threshold 0.95 --strict

95% threshold AND zero failures. Use for production-bound merges.

Soft gate (PR feedback)

evalguard gate --threshold 0.8 --json | tee gate.json

80% threshold + JSON output. Comments the result on the PR but lets maintainers override.

Safety-only gate (red-team focused)

evalguard gate --suite safety --threshold 1.0 --strict

Safety suite + 100% threshold + strict mode. Any single jailbreak succeeding fails the build. Use for customer-facing LLM features.