Skip to content

Prompt Optimizer

Closed-loop prompt optimization powered by 4 mutation strategies and meta-LLM scoring. Submit a prompt + an eval dataset and pick a strategy; the optimizer iterates variants, runs each through your evals, and surfaces the highest-scoring prompt — along with a full changelog so you can audit how it got there.

What this closes vs competitors

  • Arize AX: ships the equivalent (AI-powered prompt builder with meta-prompting) as Enterprise-only. We ship it in the Pro tier.
  • LangSmith Prompt Hub: ships versioning + tags but no optimizer — manual prompt iteration only.
  • DeepEval: ships G-Eval rubrics for grading prompts but no closed-loop variant search.
  • Promptfoo: ships prompt comparison via YAML config — every variant is human-authored. We auto-generate variants and run them through your evals.

How it works — closed-loop optimization

Three components running in concert:

  1. 4 mutation strategies generate prompt variants from the baseline (see strategy catalog below). You pick one strategy per run — iterative refinement, genetic search, few-shot injection, or constraint tightening — and each mutates a different dimension of the prompt.
  2. Eval-driven search loop drives the chosen strategy. Iterative refinement and constraint tightening rewrite-then-re-evaluate until the target score or iteration cap is reached; the genetic strategy evolves a population with elitism, crossover, and mutation across generations.
  3. Meta-LLM scorer runs the customer-supplied eval cases through your chosen scorers against each variant's outputs. The mean score (0–1) is the reward signal that drives the search. Optimization stops at the configured targetScore (default 0.95) or the maxIterations cap (default 10).

4 strategies

Each strategy ships out-of-the-box; pick one per request via the strategy field.

Iterative refinement

iterative-refinement

A meta-LLM diagnoses where the current prompt fails on your eval cases, rewrites it to address those weaknesses, re-evaluates, and repeats until the target score or iteration cap is hit.

'You are a helpful assistant' → 'You are a senior customer-support specialist for SaaS billing. Reply in ≤2 sentences. Never invent policies.' — quality lift on the same eval set. Config: maxIterations, targetScore.

Genetic algorithm

genetic

Population-based search: generate a population of prompt variants, evaluate each, keep the top performers (elitism), then crossover and mutate to produce the next generation.

Config: populationSize (default 8), mutationRate (default 0.3), crossoverRate (default 0.5), eliteCount (default 2). Best variant across all generations wins.

Few-shot injection

few-shot-injection

Automatically selects the most useful few-shot examples from your eval cases and injects them into the prompt, measuring each example's contribution to the score.

Config: maxExamples (default 5). The optimizer picks the examples that lift the score the most and drops the rest.

Constraint tightening

constraint-tightening

Analyzes the failing eval cases and adds specific constraints / guardrails to the prompt to prevent those failure modes.

Repeated policy hallucinations → optimizer adds 'Never invent policies; if unsure, say you don't know' and re-evaluates against the eval set.

API

POST /api/v1/prompts/optimize

{
  "projectId": "a1b2c3d4-0000-0000-0000-000000000000",
  "prompt": "You are a helpful assistant. Answer: {{input}}",
  "strategy": "iterative-refinement",
  "evalCases": [
    { "input": "What's our refund policy?", "expectedOutput": "30 days" },
    { "input": "How do I cancel?", "expectedOutput": "/account/cancel" }
  ],
  "scorers": ["answer-relevance", "faithfulness"],
  "targetModel": "gpt-4o-mini",
  "maxIterations": 10,
  "targetScore": 0.95,
  "costCeilingUsd": 5
}

Response:

{
  "optimizedPrompt": "You are a senior customer-support specialist...",
  "originalScore": 0.64,
  "optimizedScore": 0.96,
  "improvementPercent": 50,
  "strategy": "iterative-refinement",
  "iterations": 8,
  "changelog": [
    "Iteration 1: tightened role + tone (0.64 → 0.79)",
    "Iteration 4: added refund-policy constraint (0.79 → 0.93)",
    "Iteration 8: target score reached (0.96)"
  ],
  "durationMs": 18420,
  "costUsd": 0.0123
}

curl

curl -X POST https://evalguard.ai/api/v1/prompts/optimize \
  -H "Authorization: Bearer $EVALGUARD_API_KEY" \
  -H "Content-Type: application/json" \
  -d '{
    "projectId": "a1b2c3d4-0000-0000-0000-000000000000",
    "prompt": "You are a helpful assistant. Answer: {{input}}",
    "strategy": "iterative-refinement",
    "evalCases": [
      { "input": "What'\''s our refund policy?", "expectedOutput": "30 days" }
    ],
    "scorers": ["answer-relevance", "faithfulness"],
    "targetModel": "gpt-4o-mini",
    "maxIterations": 10,
    "targetScore": 0.95
  }'

Dashboard UI

The same optimizer is accessible via the dashboard at /dashboard/prompts/optimize. Drop in a prompt + paste an eval dataset, pick a strategy, and watch the optimizer converge on the best prompt. The result shows the optimized prompt, the original vs optimized score, the improvement percent, and the per-iteration changelog — so you can audit the optimizer's choices.

Best practices

  • Eval cases come first. The optimizer is only as good as your eval. Ship a real test set of 50-200 cases before optimizing — single-digit eval sets will optimize against noise.
  • Pick the strategy that fits the failure. iterative-refinement is the default-good choice. Use constraint-tightening when the prompt keeps making the same mistake, or few-shot-injection when worked examples help.
  • Set a cost ceiling. costCeilingUsd caps total LLM spend on the optimization run itself (default $5, hard cap $100). The run aborts the moment cumulative spend crosses it.
  • Audit the changelog. The optimizer returns a changelog (one entry per iteration) plus the original and optimized scores. Sharp gains early + flat after = converged. Slow steady climb = increase maxIterations.
  • Save winning variants to the Prompt Hub. Once converged, push the winner to your prompt registry with a version tag (production) so it's versioned alongside your other prompt deployments.