Agentic

Benchmarks

Explore definitive, multi-step trajectory benchmarks. Track how frontier models perform on complex, real-world workflows across specialized domains.

Highlighted benchmarks

Objective metrics across 5 specialized domains

Coding & Software Engineering

SWE-bench

The industry standard for multi-step agentic coding. Models navigate codebases to resolve real GitHub issues, verified strictly by compiler and unit tests.

1
Claude 4.5 Opus (high reasoning)
76.80%
2
Gemini 3 Flash (high reasoning)
75.80%
3
MiniMax M2.5 (high reasoning)
75.80%
4
Claude Opus 4.6
75.60%
5
GPT-5-2 Codex
72.80%
Legal

Legal Agent Benchmark (LAB)

Tests agents on open-ended assignments like M&A data room reviews. Agents must read, cross-reference, and draft documents over long horizons without hallucinating under a strict all-pass metric.

1
Claude 4.7 Opus
7.1%
2
Claude 4.6 Sonnet
5.4%
3
Claude 4.6 Opus
4.2%
4
GPT-5.5
2.1%
5
Gemini 3.5 Flash
0.8%
Finance

DABstep (Data Agent Bench)

Tests multi-step reasoning by asking agents to query structured databases and read unstructured financial manuals to solve payments, fraud, and financial analysis use-cases.

1
NVIDIA KGMON (NeMo Agent Toolkit) Data Explorer
89.95%
2
DataPilot
87.57%
3
gg-agent-gpt5-1104-1
62.96%
4
CambioML energent.ai DS Agent
57.67%
5
DS-STAR
45.24%
STEM

WebArena

Drops an agent into a sandboxed web browser to perform complex, multi-page STEM and IT workflows. Success is verified objectively by checking the final application state.

1
Claude Mythos Preview
68.7%
2
GPT-5.4 Pro
65.8%
3
Claude 4.6 Opus
64.5%
4
GPT-5.3 Codex
59.1%
5
Gemini 3.1 Pro
54.3%
Medical

MedAgentBench v2

Evaluates multi-step, clinically-driven workflows inside a FHIR-compliant Electronic Health Record (EHR) sandbox, requiring tool use and deep diagnostic reasoning.

1
Claude 4.6 Opus
74.1%
2
Claude 4.6 Sonnet
72.9%
3
Grok 4
71.9%
4
GPT-5.5
57.7%
5
Gemini 3.1 Pro
39.3%