Benchmarks

28 of EvalGuard's 34 benchmark suites detailed here — 17 academic knowledge + reasoning benchmarks and 11 safety/adversarial datasets. Each ships as a runnable suite with sample cases, automated scoring, and reproducible runs.

Detailed here

Academic

Safety / adversarial

Why benchmark coverage matters

Benchmarks anchor model quality to the same numbers cited in OpenAI / Anthropic / DeepMind / Meta model cards. EvalGuard ships every benchmark DeepEval markets (MMLU, HellaSwag, BIG-Bench Hard, DROP, TruthfulQA, HumanEval, GSM8K) plus 10 more academic benchmarks and 11 named safety datasets. Run any of them with one command:

evalguard benchmark run mmlu --model gpt-4o
evalguard benchmark run truthfulqa --model claude-opus-4
evalguard benchmark run humaneval --model gemini-2.5-pro

Academic & reasoning benchmarks (17)

Knowledge, reasoning, code, math, and vertical-domain benchmarks for comparing models against the same numbers in OpenAI / Anthropic / DeepMind / Meta technical reports.

MMLU

mmlu

General knowledge

Massive Multitask Language Understanding — 57 academic subjects from elementary school through professional-level (US history, computer science, law, medicine, ethics, etc.).

Source: Hendrycks et al., 2021 (ICLR)

Measures: Knowledge breadth across STEM, humanities, social sciences, professional domains.

BIG-Bench Hard (BBH)

bigbench-hard

Reasoning

23 challenging BIG-Bench tasks where the prior-best LM averaged below average human rater performance.

Source: Suzgun et al., 2023 (Google + Stanford)

Measures: Multi-step reasoning, symbol manipulation, logical deduction.

DROP

drop

Reading comprehension

Discrete Reasoning Over Paragraphs — requires multi-step arithmetic, counting, sorting over text-extracted facts.

Source: Dua et al., 2019 (NAACL)

Measures: Reading comprehension + numerical reasoning.

BoolQ

boolq

Reading comprehension

15,942 yes/no question-passage pairs from natural Google search queries.

Source: Clark et al., 2019 (NAACL)

Measures: Reading comprehension for binary entailment.

TruthfulQA

truthfulqa

Factuality

817 questions across 38 categories designed to elicit false answers from imitative LMs (urban legends, common misconceptions, conspiracy theories).

Source: Lin et al., 2022 (ACL)

Measures: Truthfulness vs imitative falsehoods. Critical pre-deployment screen.

HellaSwag

hellaswag

Commonsense reasoning

Multiple-choice sentence completion testing commonsense reasoning. Designed adversarially so humans score 95%+ but models often fail.

Source: Zellers et al., 2019 (ACL)

Measures: Commonsense world understanding via adversarial completion.

HumanEval

humaneval

Code

164 hand-crafted Python programming problems. Pass@1 measured by running unit tests against generated code.

Source: Chen et al., 2021 (OpenAI Codex paper)

Measures: Code synthesis with executable verification.

GSM8K

gsm8k

Math

8,500 grade-school math word problems requiring 2-8 step reasoning chains.

Source: Cobbe et al., 2021 (OpenAI)

Measures: Multi-step arithmetic reasoning, chain-of-thought quality.

ARC

arc

Science reasoning

AI2 Reasoning Challenge — 7,787 grade-school science questions split into Easy + Challenge sets.

Source: Clark et al., 2018 (AI2)

Measures: Science knowledge + multi-step reasoning.

BBQ

bbq

Bias

Bias Benchmark for QA — measures social biases in 9 demographic categories (age, disability, gender identity, nationality, physical appearance, race/ethnicity, religion, SES, sexual orientation).

Source: Parrish et al., 2022 (ACL)

Measures: Social bias under ambiguous + disambiguating contexts.

IFEval

ifeval

Instruction following

Verifiable instruction-following — measures whether a model follows specific formatting / length / structure constraints in a prompt.

Source: Zhou et al., 2023 (Google)

Measures: Programmatic verifiable instruction adherence.

MMMU

mmmu

Multimodal

Massive Multi-discipline Multimodal Understanding — 11,500 questions across 6 disciplines requiring image + text reasoning.

Source: Yue et al., 2024 (CVPR)

Measures: Vision-language reasoning at college level.

VisionBench

visionbench

Multimodal

Curated suite covering image-question reasoning, OCR, chart interpretation, scientific figure comprehension.

Source: EvalGuard-curated, drawn from MMMU + MMVet + ChartQA

Measures: Production vision-language model quality.

MedQA

medqa

Vertical — Medical

12,723 questions from US Medical Licensing Examination (USMLE) covering disease diagnosis, treatment, ethics.

Source: Jin et al., 2021

Measures: Medical knowledge to USMLE board-passing standard.

LegalBench

legalbench

Vertical — Legal

Collaborative benchmark with 162 tasks covering legal reasoning across issue spotting, rule recall, application, and conclusions.

Source: Guha et al., 2023 (NeurIPS Datasets)

Measures: Legal-domain reasoning at attorney-grade fidelity.

Financial Reasoning Benchmark

financebench

Vertical — Finance

Financial reasoning over calculations, accounting, market analysis, risk, and compliance. EvalGuard-authored synthetic corpus; the external FinanceBench dataset is not bundled.

Source: EvalGuard synthetic (bring-your-own-license)

Measures: Financial-domain reasoning.

CyberBench

cyberbench

Vertical — Security

Cybersecurity question bank across pentesting, vulnerability classification, threat intelligence, CWE/CVE matching.

Source: Liu et al., 2024

Measures: Cybersecurity domain knowledge.

Safety & adversarial benchmarks (11)

Adversarial datasets for measuring refusal quality, jailbreak resistance, and harm-category coverage. Each is referenced by ID in our red-team plugin registry (see /docs/plugins) so they double as both standalone benchmarks AND first-class red-team plugins.

AEGIS

aegis

Content safety

NVIDIA AEGIS — 26K prompts across 13 risk categories.

Source: NVIDIA, 2024 (CC-BY-4.0)

Measures: Adversarial safety refusal on enterprise-grade risk taxonomy.

Harmful-Content Refusal

beavertails

Harmful content

Refusal probes across 14 harm categories. EvalGuard-authored synthetic corpus, inspired by the BeaverTails taxonomy — the dataset itself is not bundled.

Source: EvalGuard synthetic (BeaverTails taxonomy, bring-your-own-license)

Measures: Refusal across a broad harm taxonomy.

HarmBench

harmbench

Adversarial

CAIS HarmBench — 510 standardized harmful behaviors across 7 categories. Reference benchmark in Anthropic / OpenAI / DeepMind model cards.

Source: Mazeika et al., CAIS 2024 (MIT)

Measures: Standardized adversarial defence baseline.

Pliny / L1B3RT4S

pliny

Jailbreak

1,500+ field-tested jailbreaks against GPT-4 / Claude / Gemini / Llama / Mistral.

Source: elder-plinius, ongoing (MIT)

Measures: Real-world jailbreak resistance.

Toxic-Conversation Robustness

toxicchat

Production toxicity

Toxic-conversation robustness probes modeling production-distribution attack patterns. EvalGuard-authored synthetic corpus, inspired by the ToxicChat taxonomy — the dataset itself is not bundled.

Source: EvalGuard synthetic (ToxicChat taxonomy, bring-your-own-license)

Measures: Production-distribution attack patterns.

CyberSecEval

cyberseceval

Code security

Meta Purple Llama v3 — 50 CWEs + 10 MITRE ATT&CK categories. Cited in Llama 3 + GPT-4o safety model cards.

Source: Meta, 2024 (MIT)

Measures: Secure-code generation + cyber-attack assistance refusal.

Unsafe-Content Safety

unsafebench

Multimodal safety

Unsafe-content safety probes across 11 categories. EvalGuard-authored synthetic corpus, inspired by the UnsafeBench taxonomy — the dataset itself is not bundled.

Source: EvalGuard synthetic (UnsafeBench taxonomy, bring-your-own-license)

Measures: Vision-language safety alignment.

VLGuard

vlguard

Multimodal safety

3,000 image-question pairs across 4 safety categories.

Source: Edinburgh, 2024 (CC-BY-4.0)

Measures: Vision-language model fine-tuning + evaluation.

VLSU

vlsu

Multimodal safety

Vision-Language Safety Understanding — refusal calibration on visual-textual harmful pairs.

Source: Academic benchmark, 2024

Measures: VLM refusal calibration depth.

Do-Not-Answer

donotanswer

Refusal calibration

939 prompts an LLM should refuse, across 5 risk areas and 12 subcategories.

Source: LibrAI, 2023 (Apache 2.0)

Measures: Refusal calibration — too lax vs too strict.

XSTest

xstest

Over-refusal

250 safe prompts that LOOK harmful + 200 unsafe contrasts.

Source: Bocconi, 2024 (CC-BY-4.0)

Measures: Over-refusal — the inverse failure mode to jailbreaks.

Public safety corpus — via the firewall-benchmark API

Apache-2.0 · public sample split · 10 source taxonomies

A unified, license-clean slice of the safety corpus is served from the EvalGuard firewall-benchmark API so the research community can benchmark refusal calibration, jailbreak resistance, and content-policy behaviour against the same prompts EvalGuard uses internally. The public sample split is fully visible; the held-out split is served as metadata only (category list + counts).

GET /api/v1/firewall-benchmark/public-corpus

Every prompt in the public corpus is original EvalGuard staff writing, inspired by 10 published safety-research taxonomies (AEGIS, BeaverTails, CyberSecEval, DoNotAnswer, HarmBench, ToxicChat, UnsafeBench, VLGuard, VLSU, XSTest). We license-audited the source corpora before release — the Pliny / L1B3RT4S jailbreak archive is intentionally not included.

Fetch it over HTTP (no API key required):

curl https://evalguard.ai/api/v1/firewall-benchmark/public-corpus

# Response: { meta, publicCorpus: [{ input, category, source, expectedBehavior }], note }
# For each prompt, check whether your model's response matches expectedBehavior
# ("block" for prompts that should be refused, "allow" for safe prompts).

CLI + SDK

Every benchmark is runnable from the CLI:

# Run a single benchmark (suite is a positional arg)
evalguard benchmark run mmlu --model gpt-4o

# Compare models — run the same benchmark once per model
evalguard benchmark run humaneval --model gpt-4o
evalguard benchmark run humaneval --model claude-opus-4

# Save the report (emit JSON with --json and redirect stdout)
evalguard benchmark run gsm8k --model gpt-4o --json > report.json

The CLI runs the suite locally (via @evalguard/core), so it needs no API key. The cross-model leaderboard — aggregated from eval and security-scan data — is also available from the Python SDK:

from evalguard import EvalGuardClient

client = EvalGuardClient(api_key="eg_live_...")
board = client.get_leaderboard(category="overall")
print(board["leaderboard"])  # [{"model": "gpt-4o", "overallScore": 0.87, ...}, ...]