Agentic

Benchmarks

Explore definitive, multi-step trajectory benchmarks. Track how frontier models perform on complex, real-world workflows across specialized domains.

View Leaderboards

Highlighted benchmarks

Objective metrics across 5 specialized domains

Coding & Software Engineering

SWE-bench

The industry standard for multi-step agentic coding. Models navigate codebases to resolve real GitHub issues, verified strictly by compiler and unit tests.

Claude 4.5 Opus (high reasoning)

76.80%

Gemini 3 Flash (high reasoning)

75.80%

MiniMax M2.5 (high reasoning)

Legal Agent Benchmark (LAB)

Tests agents on open-ended assignments like M&A data room reviews. Agents must read, cross-reference, and draft documents over long horizons without hallucinating under a strict all-pass metric.

DABstep (Data Agent Bench)

Tests multi-step reasoning by asking agents to query structured databases and read unstructured financial manuals to solve payments, fraud, and financial analysis use-cases.

NVIDIA KGMON (NeMo Agent Toolkit) Data Explorer

CambioML energent.ai DS Agent

WebArena

Drops an agent into a sandboxed web browser to perform complex, multi-page STEM and IT workflows. Success is verified objectively by checking the final application state.

Claude Mythos Preview

MedAgentBench v2

Evaluates multi-step, clinically-driven workflows inside a FHIR-compliant Electronic Health Record (EHR) sandbox, requiring tool use and deep diagnostic reasoning.