OpenAI
openaiYAML config
providers:
- id: openai:gpt-4o-mini
config:
apiKey: ${OPENAI_API_KEY}TypeScript usage
import { createProvider } from "@evalguard/core";
const provider = createProvider("openai", process.env.OPENAI_API_KEY);
const response = await provider.complete({
model: "gpt-4o-mini",
messages: [{ role: "user", content: "Hello" }],
});Authentication
Set OPENAI_API_KEY in your environment. EvalGuard validates the key on first call and surfaces typed errors for 401 / 403 / rate-limit responses (with Retry-After parsing).
Setup walkthrough
- 1. Generate an API key at https://platform.openai.com/api-keys (or use your org's existing one).
- 2. Set the key in your environment: `export OPENAI_API_KEY=sk-...` for local; for prod, store via your secret manager (AWS Secrets Manager, GCP Secret Manager, Vault).
- 3. (Optional but recommended) Store the key in EvalGuard's BYOK vault via `/dashboard/api-keys` — the key never appears in logs, traces, or worker memory.
- 4. Configure the provider in your eval YAML: `providers: [{id: 'openai:gpt-4o-mini', config: {apiKey: '${OPENAI_API_KEY}'}}]`.
- 5. Smoke test: `evalguard run --provider openai:gpt-4o-mini --prompt 'Hello'` should return a response.
Gotchas
- Tier 1 rate limits (newly-funded org) cap at 500 RPM for gpt-4o-mini. Plan eval batch sizes accordingly or upgrade tier.
- Vision models (gpt-4o, gpt-4o-mini) accept `image_url` parts but the URL must be publicly fetchable from OpenAI's infrastructure — base64 data: URIs work for private images.
- `response_format: {type: 'json_object'}` requires the word 'json' in the system or user prompt or the API returns 400. Use `json_schema` for stricter structured-output enforcement.
- Prompt caching applies automatically on prompts ≥1024 tokens, including the system prompt. Reorder dynamic content to AFTER the static prompt prefix to maximize cache hits.
Cost note
gpt-4o-mini: $0.15/M input, $0.60/M output. gpt-4o: $2.50/M input, $10/M output. Prompt caching gives 50% discount on cached prefix tokens. Track per-eval cost via `cost_ledger` (T4.2).
Recommended models
- Eval / judge
- gpt-4o-mini — best price/quality for scorer-as-judge
- Agent / tool-use
- gpt-4o — function calling reliability + 128K context
- Code
- gpt-4o — best at multi-language code generation
- Vision
- gpt-4o — strong OCR + UI understanding
Hand-written · 2026-05-21