Skip to content

Benchmarks

28 first-class benchmark suites — 17 academic knowledge + reasoning benchmarks and 11 safety/adversarial datasets. Each ships as a runnable suite with sample cases, automated scoring, and reproducible runs.

Why benchmark coverage matters

Benchmarks anchor model quality to the same numbers cited in OpenAI / Anthropic / DeepMind / Meta model cards. EvalGuard ships every benchmark DeepEval markets (MMLU, HellaSwag, BIG-Bench Hard, DROP, TruthfulQA, HumanEval, GSM8K) plus 10 more academic benchmarks and 11 named safety datasets. Run any of them with one command:

evalguard benchmark run --suite mmlu --model gpt-4o
evalguard benchmark run --suite harmbench --model claude-opus-4
evalguard benchmark run --suite humaneval --model gemini-2.5-pro

Academic & reasoning benchmarks (17)

Knowledge, reasoning, code, math, and vertical-domain benchmarks for comparing models against the same numbers in OpenAI / Anthropic / DeepMind / Meta technical reports.

MMLU

mmlu

General knowledge

Massive Multitask Language Understanding — 57 academic subjects from elementary school through professional-level (US history, computer science, law, medicine, ethics, etc.).

Source: Hendrycks et al., 2021 (ICLR)
Measures: Knowledge breadth across STEM, humanities, social sciences, professional domains.

BIG-Bench Hard (BBH)

bigbench-hard

Reasoning

23 challenging BIG-Bench tasks where the prior-best LM averaged below average human rater performance.

Source: Suzgun et al., 2023 (Google + Stanford)
Measures: Multi-step reasoning, symbol manipulation, logical deduction.

DROP

drop

Reading comprehension

Discrete Reasoning Over Paragraphs — requires multi-step arithmetic, counting, sorting over text-extracted facts.

Source: Dua et al., 2019 (NAACL)
Measures: Reading comprehension + numerical reasoning.

BoolQ

boolq

Reading comprehension

15,942 yes/no question-passage pairs from natural Google search queries.

Source: Clark et al., 2019 (NAACL)
Measures: Reading comprehension for binary entailment.

TruthfulQA

truthfulqa

Factuality

817 questions across 38 categories designed to elicit false answers from imitative LMs (urban legends, common misconceptions, conspiracy theories).

Source: Lin et al., 2022 (ACL)
Measures: Truthfulness vs imitative falsehoods. Critical pre-deployment screen.

HellaSwag

hellaswag

Commonsense reasoning

Multiple-choice sentence completion testing commonsense reasoning. Designed adversarially so humans score 95%+ but models often fail.

Source: Zellers et al., 2019 (ACL)
Measures: Commonsense world understanding via adversarial completion.

HumanEval

humaneval

Code

164 hand-crafted Python programming problems. Pass@1 measured by running unit tests against generated code.

Source: Chen et al., 2021 (OpenAI Codex paper)
Measures: Code synthesis with executable verification.

GSM8K

gsm8k

Math

8,500 grade-school math word problems requiring 2-8 step reasoning chains.

Source: Cobbe et al., 2021 (OpenAI)
Measures: Multi-step arithmetic reasoning, chain-of-thought quality.

ARC

arc

Science reasoning

AI2 Reasoning Challenge — 7,787 grade-school science questions split into Easy + Challenge sets.

Source: Clark et al., 2018 (AI2)
Measures: Science knowledge + multi-step reasoning.

BBQ

bbq

Bias

Bias Benchmark for QA — measures social biases in 9 demographic categories (age, disability, gender identity, nationality, physical appearance, race/ethnicity, religion, SES, sexual orientation).

Source: Parrish et al., 2022 (ACL)
Measures: Social bias under ambiguous + disambiguating contexts.

IFEval

ifeval

Instruction following

Verifiable instruction-following — measures whether a model follows specific formatting / length / structure constraints in a prompt.

Source: Zhou et al., 2023 (Google)
Measures: Programmatic verifiable instruction adherence.

MMMU

mmmu

Multimodal

Massive Multi-discipline Multimodal Understanding — 11,500 questions across 6 disciplines requiring image + text reasoning.

Source: Yue et al., 2024 (CVPR)
Measures: Vision-language reasoning at college level.

VisionBench

visionbench

Multimodal

Curated suite covering image-question reasoning, OCR, chart interpretation, scientific figure comprehension.

Source: EvalGuard-curated, drawn from MMMU + MMVet + ChartQA
Measures: Production vision-language model quality.

MedQA

medqa

Vertical — Medical

12,723 questions from US Medical Licensing Examination (USMLE) covering disease diagnosis, treatment, ethics.

Source: Jin et al., 2021
Measures: Medical knowledge to USMLE board-passing standard.

LegalBench

legalbench

Vertical — Legal

Collaborative benchmark with 162 tasks covering legal reasoning across issue spotting, rule recall, application, and conclusions.

Source: Guha et al., 2023 (NeurIPS Datasets)
Measures: Legal-domain reasoning at attorney-grade fidelity.

FinanceBench

financebench

Vertical — Finance

10,231 question-answer pairs grounded in real public-company financial documents (10-K filings).

Source: Islam et al., 2023 (PatronusAI)
Measures: Financial reasoning over real SEC filings.

CyberBench

cyberbench

Vertical — Security

Cybersecurity question bank across pentesting, vulnerability classification, threat intelligence, CWE/CVE matching.

Source: Liu et al., 2024
Measures: Cybersecurity domain knowledge.

Safety & adversarial benchmarks (11)

Adversarial datasets for measuring refusal quality, jailbreak resistance, and harm-category coverage. Each is referenced by ID in our red-team plugin registry (see /docs/plugins) so they double as both standalone benchmarks AND first-class red-team plugins.

AEGIS

aegis

Content safety

NVIDIA AEGIS — 26K prompts across 13 risk categories.

Source: NVIDIA, 2024 (CC-BY-4.0)
Measures: Adversarial safety refusal on enterprise-grade risk taxonomy.

BeaverTails

beavertails

Harmful content

PKU-Alignment BeaverTails — 333,963 QA pairs across 14 harm categories.

Source: PKU, 2023 (CC-BY-NC-4.0)
Measures: Refusal across the broadest harm taxonomy in the field.

HarmBench

harmbench

Adversarial

CAIS HarmBench — 510 standardized harmful behaviors across 7 categories. Reference benchmark in Anthropic / OpenAI / DeepMind model cards.

Source: Mazeika et al., CAIS 2024 (MIT)
Measures: Standardized adversarial defence baseline.

Pliny / L1B3RT4S

pliny

Jailbreak

1,500+ field-tested jailbreaks against GPT-4 / Claude / Gemini / Llama / Mistral.

Source: elder-plinius, ongoing (MIT)
Measures: Real-world jailbreak resistance.

ToxicChat

toxicchat

Production toxicity

10,166 real Vicuna conversations annotated for toxicity + jailbreak attempts.

Source: LMSYS, 2023 (CC-BY-NC-4.0)
Measures: Production-distribution attack patterns (not synthetic).

CyberSecEval

cyberseceval

Code security

Meta Purple Llama v3 — 50 CWEs + 10 MITRE ATT&CK categories. Cited in Llama 3 + GPT-4o safety model cards.

Source: Meta, 2024 (MIT)
Measures: Secure-code generation + cyber-attack assistance refusal.

UnsafeBench

unsafebench

Multimodal safety

10,000+ unsafe image prompts across 11 categories.

Source: CISPA, 2024 (CC-BY-NC-SA-4.0)
Measures: Vision-language safety alignment.

VLGuard

vlguard

Multimodal safety

3,000 image-question pairs across 4 safety categories.

Source: Edinburgh, 2024 (CC-BY-4.0)
Measures: Vision-language model fine-tuning + evaluation.

VLSU

vlsu

Multimodal safety

Vision-Language Safety Understanding — refusal calibration on visual-textual harmful pairs.

Source: Academic benchmark, 2024
Measures: VLM refusal calibration depth.

Do-Not-Answer

donotanswer

Refusal calibration

939 prompts an LLM should refuse, across 5 risk areas and 12 subcategories.

Source: LibrAI, 2023 (Apache 2.0)
Measures: Refusal calibration — too lax vs too strict.

XSTest

xstest

Over-refusal

250 safe prompts that LOOK harmful + 200 unsafe contrasts.

Source: Bocconi, 2024 (CC-BY-4.0)
Measures: Over-refusal — the inverse failure mode to jailbreaks.

Public release — eg-safety-bench-1k on HuggingFace

Apache-2.0 · 785 prompts · 10 source taxonomies

We've published a unified, license-clean slice of the safety corpus as a public HuggingFace dataset so the research community can benchmark refusal calibration, jailbreak resistance, and content-policy behaviour against the same prompts EvalGuard uses internally.

huggingface.co/datasets/evalguard/eg-safety-bench-1k →

Every prompt in the public dataset is original EvalGuard staff writing, inspired by 10 published safety-research taxonomies (AEGIS, BeaverTails, CyberSecEval, DoNotAnswer, HarmBench, ToxicChat, UnsafeBench, VLGuard, VLSU, XSTest). We license-audited the source corpora before release — the Pliny / L1B3RT4S jailbreak archive is intentionally not included.

Use it from Python:

from datasets import load_dataset

ds = load_dataset("evalguard/eg-safety-bench-1k")
for row in ds["train"]:
    response = your_model.generate(row["prompt"])
    refused = is_refusal(response)
    correct = (refused and row["expected_behavior"] == "block") or \
              (not refused and row["expected_behavior"] == "allow")

CLI + SDK

Every benchmark is runnable from the CLI:

# Run a single benchmark
evalguard benchmark run --suite mmlu --model gpt-4o

# Compare 4 models on the same benchmark
evalguard benchmark run --suite humaneval \
  --model gpt-4o,claude-opus-4,gemini-2.5-pro,llama-3.3-70b

# Save the report
evalguard benchmark run --suite gsm8k --model gpt-4o --output report.json

Or invoke directly from the Python SDK:

from evalguard import EvalGuardClient

client = EvalGuardClient()
result = client.run_benchmark(suite="mmlu", model="gpt-4o")
print(f"Score: {result.score:.2%}")  # 0.87 = 87%