Benchmarks
28 first-class benchmark suites — 17 academic knowledge + reasoning benchmarks and 11 safety/adversarial datasets. Each ships as a runnable suite with sample cases, automated scoring, and reproducible runs.
Why benchmark coverage matters
Benchmarks anchor model quality to the same numbers cited in OpenAI / Anthropic / DeepMind / Meta model cards. EvalGuard ships every benchmark DeepEval markets (MMLU, HellaSwag, BIG-Bench Hard, DROP, TruthfulQA, HumanEval, GSM8K) plus 10 more academic benchmarks and 11 named safety datasets. Run any of them with one command:
evalguard benchmark run mmlu --model gpt-4o evalguard benchmark run harmbench --model claude-opus-4 evalguard benchmark run humaneval --model gemini-2.5-pro
Academic & reasoning benchmarks (17)
Knowledge, reasoning, code, math, and vertical-domain benchmarks for comparing models against the same numbers in OpenAI / Anthropic / DeepMind / Meta technical reports.
MMLU
General knowledge
Massive Multitask Language Understanding — 57 academic subjects from elementary school through professional-level (US history, computer science, law, medicine, ethics, etc.).
BIG-Bench Hard (BBH)
Reasoning
23 challenging BIG-Bench tasks where the prior-best LM averaged below average human rater performance.
DROP
Reading comprehension
Discrete Reasoning Over Paragraphs — requires multi-step arithmetic, counting, sorting over text-extracted facts.
BoolQ
Reading comprehension
15,942 yes/no question-passage pairs from natural Google search queries.
TruthfulQA
Factuality
817 questions across 38 categories designed to elicit false answers from imitative LMs (urban legends, common misconceptions, conspiracy theories).
HellaSwag
Commonsense reasoning
Multiple-choice sentence completion testing commonsense reasoning. Designed adversarially so humans score 95%+ but models often fail.
HumanEval
Code
164 hand-crafted Python programming problems. Pass@1 measured by running unit tests against generated code.
GSM8K
Math
8,500 grade-school math word problems requiring 2-8 step reasoning chains.
ARC
Science reasoning
AI2 Reasoning Challenge — 7,787 grade-school science questions split into Easy + Challenge sets.
BBQ
Bias
Bias Benchmark for QA — measures social biases in 9 demographic categories (age, disability, gender identity, nationality, physical appearance, race/ethnicity, religion, SES, sexual orientation).
IFEval
Instruction following
Verifiable instruction-following — measures whether a model follows specific formatting / length / structure constraints in a prompt.
MMMU
Multimodal
Massive Multi-discipline Multimodal Understanding — 11,500 questions across 6 disciplines requiring image + text reasoning.
VisionBench
Multimodal
Curated suite covering image-question reasoning, OCR, chart interpretation, scientific figure comprehension.
MedQA
Vertical — Medical
12,723 questions from US Medical Licensing Examination (USMLE) covering disease diagnosis, treatment, ethics.
LegalBench
Vertical — Legal
Collaborative benchmark with 162 tasks covering legal reasoning across issue spotting, rule recall, application, and conclusions.
FinanceBench
Vertical — Finance
10,231 question-answer pairs grounded in real public-company financial documents (10-K filings).
CyberBench
Vertical — Security
Cybersecurity question bank across pentesting, vulnerability classification, threat intelligence, CWE/CVE matching.
Safety & adversarial benchmarks (11)
Adversarial datasets for measuring refusal quality, jailbreak resistance, and harm-category coverage. Each is referenced by ID in our red-team plugin registry (see /docs/plugins) so they double as both standalone benchmarks AND first-class red-team plugins.
AEGIS
Content safety
NVIDIA AEGIS — 26K prompts across 13 risk categories.
BeaverTails
Harmful content
PKU-Alignment BeaverTails — 333,963 QA pairs across 14 harm categories.
HarmBench
Adversarial
CAIS HarmBench — 510 standardized harmful behaviors across 7 categories. Reference benchmark in Anthropic / OpenAI / DeepMind model cards.
Pliny / L1B3RT4S
Jailbreak
1,500+ field-tested jailbreaks against GPT-4 / Claude / Gemini / Llama / Mistral.
ToxicChat
Production toxicity
10,166 real Vicuna conversations annotated for toxicity + jailbreak attempts.
CyberSecEval
Code security
Meta Purple Llama v3 — 50 CWEs + 10 MITRE ATT&CK categories. Cited in Llama 3 + GPT-4o safety model cards.
UnsafeBench
Multimodal safety
10,000+ unsafe image prompts across 11 categories.
VLGuard
Multimodal safety
3,000 image-question pairs across 4 safety categories.
VLSU
Multimodal safety
Vision-Language Safety Understanding — refusal calibration on visual-textual harmful pairs.
Do-Not-Answer
Refusal calibration
939 prompts an LLM should refuse, across 5 risk areas and 12 subcategories.
XSTest
Over-refusal
250 safe prompts that LOOK harmful + 200 unsafe contrasts.
Public release — eg-safety-bench-1k on HuggingFace
Apache-2.0 · 785 prompts · 10 source taxonomies
We've published a unified, license-clean slice of the safety corpus as a public HuggingFace dataset so the research community can benchmark refusal calibration, jailbreak resistance, and content-policy behaviour against the same prompts EvalGuard uses internally.
huggingface.co/datasets/evalguard/eg-safety-bench-1k →Every prompt in the public dataset is original EvalGuard staff writing, inspired by 10 published safety-research taxonomies (AEGIS, BeaverTails, CyberSecEval, DoNotAnswer, HarmBench, ToxicChat, UnsafeBench, VLGuard, VLSU, XSTest). We license-audited the source corpora before release — the Pliny / L1B3RT4S jailbreak archive is intentionally not included.
Use it from Python:
from datasets import load_dataset
ds = load_dataset("evalguard/eg-safety-bench-1k")
for row in ds["train"]:
response = your_model.generate(row["prompt"])
refused = is_refusal(response)
correct = (refused and row["expected_behavior"] == "block") or \
(not refused and row["expected_behavior"] == "allow")CLI + SDK
Every benchmark is runnable from the CLI:
# Run a single benchmark (suite is a positional arg) evalguard benchmark run mmlu --model gpt-4o # Compare 4 models on the same benchmark evalguard benchmark run humaneval \ --model gpt-4o,claude-opus-4,gemini-2.5-pro,llama-3.3-70b # Save the report (emit JSON with --json and redirect stdout) evalguard benchmark run gsm8k --model gpt-4o --json > report.json
The CLI runs the suite locally (via @evalguard/core), so it needs no API key. The cross-model leaderboard — aggregated from eval and security-scan data — is also available from the Python SDK:
from evalguard import EvalGuardClient
client = EvalGuardClient(api_key="eg_live_...")
board = client.get_leaderboard(category="overall")
print(board["leaderboard"]) # [{"model": "gpt-4o", "overallScore": 0.87, ...}, ...]