Prompt Optimizer
Closed-loop prompt optimization powered by 7 mutation strategies, UCB1 multi-armed-bandit selection, and meta-LLM scoring. Submit a prompt + an eval dataset, the optimizer iterates variants in parallel, runs each through your evals, and surfaces the highest-scoring variant — along with the full convergence history so you can audit how it got there.
What this closes vs competitors
- Arize AX: ships the equivalent (AI-powered prompt builder with meta-prompting) as Enterprise-only. We ship it in the Pro tier.
- LangSmith Prompt Hub: ships versioning + tags but no optimizer — manual prompt iteration only.
- DeepEval: ships G-Eval rubrics for grading prompts but no closed-loop variant search.
- Promptfoo: ships prompt comparison via YAML config — every variant is human-authored. We auto-generate variants and run them through your evals.
How it works — closed-loop optimization
Three components running in concert:
- 7 mutation strategies generate prompt variants from the baseline (see strategy catalog below). Each strategy mutates a different dimension — instruction wording, few-shot examples, output format, parameters, etc.
- UCB1 multi-armed bandit decides which variant to spend the next eval-run budget on. Standard exploration-exploitation balance with exploration rate
√2 ≈ 1.41(Auer et al. 2002). Untested variants get priority; high-mean variants get exploited. - Meta-LLM scorer runs the customer-supplied eval suite (or a faithfulness / relevance / pairwise rubric) against each variant's outputs. Score becomes the UCB1 reward signal. Convergence stops at the configured quality threshold or iteration cap.
7 strategies
Each strategy ships out-of-the-box; pick any subset on the request.
Few-shot example selection
few-shot-selectionPicks the best few-shot examples from a pool of candidates by measuring per-example contribution to the score.
Starts with 8 examples → optimizer finds 3 do most of the work + 5 hurt → ships the 3 + saves 60% tokens.
Instruction refinement
instruction-refinementIteratively rewrites the system instruction using a meta-LLM judge that grades each variant on the task.
'You are a helpful assistant' → 'You are a senior customer-support specialist for SaaS billing. Reply in ≤2 sentences. Never invent policies.' — 18% quality lift on the same eval set.
Chain-of-thought injection
chain-of-thoughtAdds a 'think step-by-step' scaffold and measures whether the explicit reasoning improves correctness.
Math word problems: adding 'Let me work through this carefully' increased pass rate from 64% → 88% on GSM8K-shape evals.
Output format optimization
output-formatTries JSON / Markdown / structured / freeform variants on the same task and keeps whatever the downstream parser tolerates best.
API extraction task: switched from freeform to strict JSON schema → 41% fewer parsing errors downstream.
Token reduction
token-reductionCompresses the prompt without losing semantic content. Measures quality retention vs token cost on each compression step.
Strip filler words + collapse repeated context → 35% token savings, <1% quality drop.
Semantic compression
compressionAggressive token reduction using a meta-LLM to rewrite the prompt in fewer words. Higher savings than naive trimming.
2,300-token system prompt → 1,100 tokens, quality retention 0.97 vs original.
Parameter tuning
parameter-tuningJoint search over temperature / top_p / max_tokens / frequency_penalty / presence_penalty. Bayesian-style with the bandit picking next sample point.
Customer support bot: temperature 0.7 → 0.3 + top_p 1.0 → 0.85 lifted pass rate +6.2% AND reduced hallucinations.
API
POST /api/v1/prompts/optimize
{
"baselinePrompt": "You are a helpful assistant. Answer: {{input}}",
"evalCases": [
{ "input": "What's our refund policy?", "expectedOutput": "30 days" },
{ "input": "How do I cancel?", "expectedOutput": "/account/cancel" }
],
"model": "gpt-4o-mini",
"strategies": [
"instruction-refinement",
"few-shot-selection",
"parameter-tuning"
],
"maxVariants": 20,
"maxIterations": 100,
"qualityThreshold": 0.95,
"tokenBudget": 50000
}Response:
{
"bestVariant": {
"id": "opt_1731234567890_a1b2c3d4",
"prompt": "You are a senior customer-support specialist...",
"strategy": "instruction-refinement",
"generation": 4,
"metrics": {
"qualityScore": 0.96,
"tokenCount": 87,
"latencyMs": 423,
"costEstimate": 0.00012,
"sampleSize": 50,
"successRate": 0.96
},
"parameters": {
"temperature": 0.3,
"topP": 0.85,
"maxTokens": 256,
"frequencyPenalty": 0,
"presencePenalty": 0
}
},
"allVariants": [/* 20 variants tested */],
"improvement": 0.32,
"iterationsRun": 84,
"totalTokensSaved": 142000,
"convergenceHistory": [0.64, 0.71, 0.79, 0.88, 0.93, 0.95, 0.96]
}Python SDK
from evalguard import EvalGuardClient
client = EvalGuardClient()
result = client.optimize_prompt(
baseline_prompt="You are a helpful assistant. Answer: {{input}}",
eval_cases=[
{"input": "What's our refund policy?", "expected_output": "30 days"},
{"input": "How do I cancel?", "expected_output": "/account/cancel"},
],
model="gpt-4o-mini",
strategies=["instruction-refinement", "few-shot-selection", "parameter-tuning"],
max_variants=20,
quality_threshold=0.95,
)
print(f"Best prompt: {result.best_variant.prompt}")
print(f"Improvement: {result.improvement * 100:.1f}%")
print(f"Tokens saved: {result.total_tokens_saved:,}")Dashboard UI
The same optimizer is accessible via the dashboard at /dashboard/prompts/optimize. Drop in a prompt + paste an eval dataset, pick strategies, watch the UCB1 bandit converge on the best variant in real time. Each generation shows: the variant text, the strategy that produced it, its quality score on your eval suite, and token cost — so you can audit the optimizer's choices.
Best practices
- Eval cases come first. The optimizer is only as good as your eval. Ship a real test set of 50-200 cases before optimizing — single-digit eval sets will optimize against noise.
- Mix strategies — don't run all 7 at once.
instruction-refinement + few-shot-selection + parameter-tuningis the default-good combo. Addtoken-reductionwhen latency / cost is the bottleneck. - Set a token budget. The bandit can run forever.
tokenBudgetcaps total LLM spend on the optimization itself — typical: 50-200k tokens for a 20-case eval × 20 variants. - Audit the convergence history. The optimizer returns
convergenceHistory(best-score per iteration). Sharp gains in the first 5-10 iterations + flat after = converged. Slow steady climb = increase iteration cap. - Save winning variants to the Prompt Hub. Once converged, push the winner to your prompt registry with a version tag (
production) so it's versioned alongside your other prompt deployments.