Skip to content

Prompt Optimizer

Closed-loop prompt optimization powered by 7 mutation strategies, UCB1 multi-armed-bandit selection, and meta-LLM scoring. Submit a prompt + an eval dataset, the optimizer iterates variants in parallel, runs each through your evals, and surfaces the highest-scoring variant — along with the full convergence history so you can audit how it got there.

What this closes vs competitors

  • Arize AX: ships the equivalent (AI-powered prompt builder with meta-prompting) as Enterprise-only. We ship it in the Pro tier.
  • LangSmith Prompt Hub: ships versioning + tags but no optimizer — manual prompt iteration only.
  • DeepEval: ships G-Eval rubrics for grading prompts but no closed-loop variant search.
  • Promptfoo: ships prompt comparison via YAML config — every variant is human-authored. We auto-generate variants and run them through your evals.

How it works — closed-loop optimization

Three components running in concert:

  1. 7 mutation strategies generate prompt variants from the baseline (see strategy catalog below). Each strategy mutates a different dimension — instruction wording, few-shot examples, output format, parameters, etc.
  2. UCB1 multi-armed bandit decides which variant to spend the next eval-run budget on. Standard exploration-exploitation balance with exploration rate √2 ≈ 1.41 (Auer et al. 2002). Untested variants get priority; high-mean variants get exploited.
  3. Meta-LLM scorer runs the customer-supplied eval suite (or a faithfulness / relevance / pairwise rubric) against each variant's outputs. Score becomes the UCB1 reward signal. Convergence stops at the configured quality threshold or iteration cap.

7 strategies

Each strategy ships out-of-the-box; pick any subset on the request.

Few-shot example selection

few-shot-selection

Picks the best few-shot examples from a pool of candidates by measuring per-example contribution to the score.

Starts with 8 examples → optimizer finds 3 do most of the work + 5 hurt → ships the 3 + saves 60% tokens.

Instruction refinement

instruction-refinement

Iteratively rewrites the system instruction using a meta-LLM judge that grades each variant on the task.

'You are a helpful assistant' → 'You are a senior customer-support specialist for SaaS billing. Reply in ≤2 sentences. Never invent policies.' — 18% quality lift on the same eval set.

Chain-of-thought injection

chain-of-thought

Adds a 'think step-by-step' scaffold and measures whether the explicit reasoning improves correctness.

Math word problems: adding 'Let me work through this carefully' increased pass rate from 64% → 88% on GSM8K-shape evals.

Output format optimization

output-format

Tries JSON / Markdown / structured / freeform variants on the same task and keeps whatever the downstream parser tolerates best.

API extraction task: switched from freeform to strict JSON schema → 41% fewer parsing errors downstream.

Token reduction

token-reduction

Compresses the prompt without losing semantic content. Measures quality retention vs token cost on each compression step.

Strip filler words + collapse repeated context → 35% token savings, <1% quality drop.

Semantic compression

compression

Aggressive token reduction using a meta-LLM to rewrite the prompt in fewer words. Higher savings than naive trimming.

2,300-token system prompt → 1,100 tokens, quality retention 0.97 vs original.

Parameter tuning

parameter-tuning

Joint search over temperature / top_p / max_tokens / frequency_penalty / presence_penalty. Bayesian-style with the bandit picking next sample point.

Customer support bot: temperature 0.7 → 0.3 + top_p 1.0 → 0.85 lifted pass rate +6.2% AND reduced hallucinations.

API

POST /api/v1/prompts/optimize

{
  "baselinePrompt": "You are a helpful assistant. Answer: {{input}}",
  "evalCases": [
    { "input": "What's our refund policy?", "expectedOutput": "30 days" },
    { "input": "How do I cancel?", "expectedOutput": "/account/cancel" }
  ],
  "model": "gpt-4o-mini",
  "strategies": [
    "instruction-refinement",
    "few-shot-selection",
    "parameter-tuning"
  ],
  "maxVariants": 20,
  "maxIterations": 100,
  "qualityThreshold": 0.95,
  "tokenBudget": 50000
}

Response:

{
  "bestVariant": {
    "id": "opt_1731234567890_a1b2c3d4",
    "prompt": "You are a senior customer-support specialist...",
    "strategy": "instruction-refinement",
    "generation": 4,
    "metrics": {
      "qualityScore": 0.96,
      "tokenCount": 87,
      "latencyMs": 423,
      "costEstimate": 0.00012,
      "sampleSize": 50,
      "successRate": 0.96
    },
    "parameters": {
      "temperature": 0.3,
      "topP": 0.85,
      "maxTokens": 256,
      "frequencyPenalty": 0,
      "presencePenalty": 0
    }
  },
  "allVariants": [/* 20 variants tested */],
  "improvement": 0.32,
  "iterationsRun": 84,
  "totalTokensSaved": 142000,
  "convergenceHistory": [0.64, 0.71, 0.79, 0.88, 0.93, 0.95, 0.96]
}

Python SDK

from evalguard import EvalGuardClient

client = EvalGuardClient()
result = client.optimize_prompt(
    baseline_prompt="You are a helpful assistant. Answer: {{input}}",
    eval_cases=[
        {"input": "What's our refund policy?", "expected_output": "30 days"},
        {"input": "How do I cancel?", "expected_output": "/account/cancel"},
    ],
    model="gpt-4o-mini",
    strategies=["instruction-refinement", "few-shot-selection", "parameter-tuning"],
    max_variants=20,
    quality_threshold=0.95,
)

print(f"Best prompt: {result.best_variant.prompt}")
print(f"Improvement: {result.improvement * 100:.1f}%")
print(f"Tokens saved: {result.total_tokens_saved:,}")

Dashboard UI

The same optimizer is accessible via the dashboard at /dashboard/prompts/optimize. Drop in a prompt + paste an eval dataset, pick strategies, watch the UCB1 bandit converge on the best variant in real time. Each generation shows: the variant text, the strategy that produced it, its quality score on your eval suite, and token cost — so you can audit the optimizer's choices.

Best practices

  • Eval cases come first. The optimizer is only as good as your eval. Ship a real test set of 50-200 cases before optimizing — single-digit eval sets will optimize against noise.
  • Mix strategies — don't run all 7 at once. instruction-refinement + few-shot-selection + parameter-tuning is the default-good combo. Add token-reduction when latency / cost is the bottleneck.
  • Set a token budget. The bandit can run forever. tokenBudget caps total LLM spend on the optimization itself — typical: 50-200k tokens for a 20-case eval × 20 variants.
  • Audit the convergence history. The optimizer returns convergenceHistory (best-score per iteration). Sharp gains in the first 5-10 iterations + flat after = converged. Slow steady climb = increase iteration cap.
  • Save winning variants to the Prompt Hub. Once converged, push the winner to your prompt registry with a version tag (production) so it's versioned alongside your other prompt deployments.