Prompt Optimizer
Closed-loop prompt optimization powered by 4 mutation strategies and meta-LLM scoring. Submit a prompt + an eval dataset and pick a strategy; the optimizer iterates variants, runs each through your evals, and surfaces the highest-scoring prompt — along with a full changelog so you can audit how it got there.
What this closes vs competitors
- Arize AX: ships the equivalent (AI-powered prompt builder with meta-prompting) as Enterprise-only. We ship it in the Pro tier.
- LangSmith Prompt Hub: ships versioning + tags but no optimizer — manual prompt iteration only.
- DeepEval: ships G-Eval rubrics for grading prompts but no closed-loop variant search.
- Promptfoo: ships prompt comparison via YAML config — every variant is human-authored. We auto-generate variants and run them through your evals.
How it works — closed-loop optimization
Three components running in concert:
- 4 mutation strategies generate prompt variants from the baseline (see strategy catalog below). You pick one strategy per run — iterative refinement, genetic search, few-shot injection, or constraint tightening — and each mutates a different dimension of the prompt.
- Eval-driven search loop drives the chosen strategy. Iterative refinement and constraint tightening rewrite-then-re-evaluate until the target score or iteration cap is reached; the genetic strategy evolves a population with elitism, crossover, and mutation across generations.
- Meta-LLM scorer runs the customer-supplied eval cases through your chosen scorers against each variant's outputs. The mean score (0–1) is the reward signal that drives the search. Optimization stops at the configured
targetScore(default 0.95) or themaxIterationscap (default 10).
4 strategies
Each strategy ships out-of-the-box; pick one per request via the strategy field.
Iterative refinement
iterative-refinementA meta-LLM diagnoses where the current prompt fails on your eval cases, rewrites it to address those weaknesses, re-evaluates, and repeats until the target score or iteration cap is hit.
'You are a helpful assistant' → 'You are a senior customer-support specialist for SaaS billing. Reply in ≤2 sentences. Never invent policies.' — quality lift on the same eval set. Config: maxIterations, targetScore.
Genetic algorithm
geneticPopulation-based search: generate a population of prompt variants, evaluate each, keep the top performers (elitism), then crossover and mutate to produce the next generation.
Config: populationSize (default 8), mutationRate (default 0.3), crossoverRate (default 0.5), eliteCount (default 2). Best variant across all generations wins.
Few-shot injection
few-shot-injectionAutomatically selects the most useful few-shot examples from your eval cases and injects them into the prompt, measuring each example's contribution to the score.
Config: maxExamples (default 5). The optimizer picks the examples that lift the score the most and drops the rest.
Constraint tightening
constraint-tighteningAnalyzes the failing eval cases and adds specific constraints / guardrails to the prompt to prevent those failure modes.
Repeated policy hallucinations → optimizer adds 'Never invent policies; if unsure, say you don't know' and re-evaluates against the eval set.
API
POST /api/v1/prompts/optimize
{
"projectId": "a1b2c3d4-0000-0000-0000-000000000000",
"prompt": "You are a helpful assistant. Answer: {{input}}",
"strategy": "iterative-refinement",
"evalCases": [
{ "input": "What's our refund policy?", "expectedOutput": "30 days" },
{ "input": "How do I cancel?", "expectedOutput": "/account/cancel" }
],
"scorers": ["answer-relevance", "faithfulness"],
"targetModel": "gpt-4o-mini",
"maxIterations": 10,
"targetScore": 0.95,
"costCeilingUsd": 5
}Response:
{
"optimizedPrompt": "You are a senior customer-support specialist...",
"originalScore": 0.64,
"optimizedScore": 0.96,
"improvementPercent": 50,
"strategy": "iterative-refinement",
"iterations": 8,
"changelog": [
"Iteration 1: tightened role + tone (0.64 → 0.79)",
"Iteration 4: added refund-policy constraint (0.79 → 0.93)",
"Iteration 8: target score reached (0.96)"
],
"durationMs": 18420,
"costUsd": 0.0123
}curl
curl -X POST https://evalguard.ai/api/v1/prompts/optimize \
-H "Authorization: Bearer $EVALGUARD_API_KEY" \
-H "Content-Type: application/json" \
-d '{
"projectId": "a1b2c3d4-0000-0000-0000-000000000000",
"prompt": "You are a helpful assistant. Answer: {{input}}",
"strategy": "iterative-refinement",
"evalCases": [
{ "input": "What'\''s our refund policy?", "expectedOutput": "30 days" }
],
"scorers": ["answer-relevance", "faithfulness"],
"targetModel": "gpt-4o-mini",
"maxIterations": 10,
"targetScore": 0.95
}'Dashboard UI
The same optimizer is accessible via the dashboard at /dashboard/prompts/optimize. Drop in a prompt + paste an eval dataset, pick a strategy, and watch the optimizer converge on the best prompt. The result shows the optimized prompt, the original vs optimized score, the improvement percent, and the per-iteration changelog — so you can audit the optimizer's choices.
Best practices
- Eval cases come first. The optimizer is only as good as your eval. Ship a real test set of 50-200 cases before optimizing — single-digit eval sets will optimize against noise.
- Pick the strategy that fits the failure.
iterative-refinementis the default-good choice. Useconstraint-tighteningwhen the prompt keeps making the same mistake, orfew-shot-injectionwhen worked examples help. - Set a cost ceiling.
costCeilingUsdcaps total LLM spend on the optimization run itself (default $5, hard cap $100). The run aborts the moment cumulative spend crosses it. - Audit the changelog. The optimizer returns a
changelog(one entry per iteration) plus the original and optimized scores. Sharp gains early + flat after = converged. Slow steady climb = increasemaxIterations. - Save winning variants to the Prompt Hub. Once converged, push the winner to your prompt registry with a version tag (
production) so it's versioned alongside your other prompt deployments.