Benchmarks
28 first-class benchmark suites — 17 academic knowledge + reasoning benchmarks and 11 safety/adversarial datasets. Each ships as a runnable suite with sample cases, automated scoring, and reproducible runs.
Why benchmark coverage matters
Benchmarks anchor model quality to the same numbers cited in OpenAI / Anthropic / DeepMind / Meta model cards. EvalGuard ships every benchmark DeepEval markets (MMLU, HellaSwag, BIG-Bench Hard, DROP, TruthfulQA, HumanEval, GSM8K) plus 10 more academic benchmarks and 11 named safety datasets. Run any of them with one command:
evalguard benchmark run --suite mmlu --model gpt-4o evalguard benchmark run --suite harmbench --model claude-opus-4 evalguard benchmark run --suite humaneval --model gemini-2.5-pro
Academic & reasoning benchmarks (17)
Knowledge, reasoning, code, math, and vertical-domain benchmarks for comparing models against the same numbers in OpenAI / Anthropic / DeepMind / Meta technical reports.
MMLU
mmluGeneral knowledge
Massive Multitask Language Understanding — 57 academic subjects from elementary school through professional-level (US history, computer science, law, medicine, ethics, etc.).
BIG-Bench Hard (BBH)
bigbench-hardReasoning
23 challenging BIG-Bench tasks where the prior-best LM averaged below average human rater performance.
DROP
dropReading comprehension
Discrete Reasoning Over Paragraphs — requires multi-step arithmetic, counting, sorting over text-extracted facts.
BoolQ
boolqReading comprehension
15,942 yes/no question-passage pairs from natural Google search queries.
TruthfulQA
truthfulqaFactuality
817 questions across 38 categories designed to elicit false answers from imitative LMs (urban legends, common misconceptions, conspiracy theories).
HellaSwag
hellaswagCommonsense reasoning
Multiple-choice sentence completion testing commonsense reasoning. Designed adversarially so humans score 95%+ but models often fail.
HumanEval
humanevalCode
164 hand-crafted Python programming problems. Pass@1 measured by running unit tests against generated code.
GSM8K
gsm8kMath
8,500 grade-school math word problems requiring 2-8 step reasoning chains.
ARC
arcScience reasoning
AI2 Reasoning Challenge — 7,787 grade-school science questions split into Easy + Challenge sets.
BBQ
bbqBias
Bias Benchmark for QA — measures social biases in 9 demographic categories (age, disability, gender identity, nationality, physical appearance, race/ethnicity, religion, SES, sexual orientation).
IFEval
ifevalInstruction following
Verifiable instruction-following — measures whether a model follows specific formatting / length / structure constraints in a prompt.
MMMU
mmmuMultimodal
Massive Multi-discipline Multimodal Understanding — 11,500 questions across 6 disciplines requiring image + text reasoning.
VisionBench
visionbenchMultimodal
Curated suite covering image-question reasoning, OCR, chart interpretation, scientific figure comprehension.
MedQA
medqaVertical — Medical
12,723 questions from US Medical Licensing Examination (USMLE) covering disease diagnosis, treatment, ethics.
LegalBench
legalbenchVertical — Legal
Collaborative benchmark with 162 tasks covering legal reasoning across issue spotting, rule recall, application, and conclusions.
FinanceBench
financebenchVertical — Finance
10,231 question-answer pairs grounded in real public-company financial documents (10-K filings).
CyberBench
cyberbenchVertical — Security
Cybersecurity question bank across pentesting, vulnerability classification, threat intelligence, CWE/CVE matching.
Safety & adversarial benchmarks (11)
Adversarial datasets for measuring refusal quality, jailbreak resistance, and harm-category coverage. Each is referenced by ID in our red-team plugin registry (see /docs/plugins) so they double as both standalone benchmarks AND first-class red-team plugins.
AEGIS
aegisContent safety
NVIDIA AEGIS — 26K prompts across 13 risk categories.
BeaverTails
beavertailsHarmful content
PKU-Alignment BeaverTails — 333,963 QA pairs across 14 harm categories.
HarmBench
harmbenchAdversarial
CAIS HarmBench — 510 standardized harmful behaviors across 7 categories. Reference benchmark in Anthropic / OpenAI / DeepMind model cards.
Pliny / L1B3RT4S
plinyJailbreak
1,500+ field-tested jailbreaks against GPT-4 / Claude / Gemini / Llama / Mistral.
ToxicChat
toxicchatProduction toxicity
10,166 real Vicuna conversations annotated for toxicity + jailbreak attempts.
CyberSecEval
cybersecevalCode security
Meta Purple Llama v3 — 50 CWEs + 10 MITRE ATT&CK categories. Cited in Llama 3 + GPT-4o safety model cards.
UnsafeBench
unsafebenchMultimodal safety
10,000+ unsafe image prompts across 11 categories.
VLGuard
vlguardMultimodal safety
3,000 image-question pairs across 4 safety categories.
VLSU
vlsuMultimodal safety
Vision-Language Safety Understanding — refusal calibration on visual-textual harmful pairs.
Do-Not-Answer
donotanswerRefusal calibration
939 prompts an LLM should refuse, across 5 risk areas and 12 subcategories.
XSTest
xstestOver-refusal
250 safe prompts that LOOK harmful + 200 unsafe contrasts.
Public release — eg-safety-bench-1k on HuggingFace
Apache-2.0 · 785 prompts · 10 source taxonomies
We've published a unified, license-clean slice of the safety corpus as a public HuggingFace dataset so the research community can benchmark refusal calibration, jailbreak resistance, and content-policy behaviour against the same prompts EvalGuard uses internally.
huggingface.co/datasets/evalguard/eg-safety-bench-1k →Every prompt in the public dataset is original EvalGuard staff writing, inspired by 10 published safety-research taxonomies (AEGIS, BeaverTails, CyberSecEval, DoNotAnswer, HarmBench, ToxicChat, UnsafeBench, VLGuard, VLSU, XSTest). We license-audited the source corpora before release — the Pliny / L1B3RT4S jailbreak archive is intentionally not included.
Use it from Python:
from datasets import load_dataset
ds = load_dataset("evalguard/eg-safety-bench-1k")
for row in ds["train"]:
response = your_model.generate(row["prompt"])
refused = is_refusal(response)
correct = (refused and row["expected_behavior"] == "block") or \
(not refused and row["expected_behavior"] == "allow")CLI + SDK
Every benchmark is runnable from the CLI:
# Run a single benchmark evalguard benchmark run --suite mmlu --model gpt-4o # Compare 4 models on the same benchmark evalguard benchmark run --suite humaneval \ --model gpt-4o,claude-opus-4,gemini-2.5-pro,llama-3.3-70b # Save the report evalguard benchmark run --suite gsm8k --model gpt-4o --output report.json
Or invoke directly from the Python SDK:
from evalguard import EvalGuardClient
client = EvalGuardClient()
result = client.run_benchmark(suite="mmlu", model="gpt-4o")
print(f"Score: {result.score:.2%}") # 0.87 = 87%