OpenAI and Paradigm have released EVMbench, a benchmark designed to evaluate the ability of AI agents to detect, patch, and exploit vulnerabilities in smart contracts.
The benchmark draws on 120 curated vulnerabilities sourced from 40 audits, the majority of which were taken from open-code audit competitions. It also incorporates vulnerability scenarios from the security auditing process for the Tempo blockchain, a Layer 1 network built for high-throughput, low-cost stablecoin payments. This inclusion is intended to extend the benchmark's coverage into payment-oriented smart contract code, reflecting anticipated growth in agentic stablecoin payment activity.
Three evaluation modes and key findings
EVMbench assesses AI agents across three capability modes. In detect mode, agents audit a smart contract repository and are scored on how thoroughly they identify known vulnerabilities. In patch mode, agents must modify vulnerable contracts to remove exploitability while preserving intended functionality, verified through automated testing. In exploit mode, agents execute fund-draining attacks against contracts deployed in a sandboxed blockchain environment, with results verified programmatically.
Across frontier models tested, performance was strongest in exploit mode, where the objective, namely draining funds, is explicit and iterative. GPT-5.3-Codex, running via Codex CLI, achieved a score of 72.2% in exploit mode, compared with 31.9% for GPT-5, which was released approximately six months prior. Performance in detect and patch modes remained below full coverage, with agents in detect mode sometimes stopping after identifying a single vulnerability rather than completing a full audit, and patch mode presenting challenges around preserving full contract functionality.
Ecosystem implications and defensive investment
Smart contracts currently secure more than USD 100 billion in open-source crypto assets. OpenAI framed EVMbench as both a measurement tool and a signal for the security community to incorporate AI-assisted auditing into standard workflows. Alongside the benchmark release, the company announced USD 10 million in API credits to support cyber defence work, with priority given to open-source software and critical infrastructure. The benchmark's tasks, tooling, and evaluation framework have been made publicly available.