EVMbench Benchmarks Detect Patch And Exploit Agent Workflows

On a local chain, transactions can be replayed deterministically under a constrained sandbox.
In that setting, an agent can read, modify, and execute smart contracts.
The goal extends beyond “finding” bugs.
It includes reproducing the vulnerability.
It includes applying a patch.
It includes reaching a point where an exploit no longer succeeds.
EVMbench evaluates agents across three modes: Detect, Patch, and Exploit.

TL;DR

EVMbench evaluates Detect, Patch, and Exploit as one workflow, not detection alone.
This framing highlights dual-use automation risks alongside defensive value.
Consider adopting Patch-style grading in CI, then validate fit against your own standards.

Example: A team runs an agent on a local sandbox. The agent fixes a contract. The team adds unseen attack tests. The team checks whether tests pass and exploits fail.

Current state

EVMbench starts from the premise that smart contracts can hold significant value.
It also assumes vulnerabilities can contribute to losses.
According to the OpenAI intro post, the benchmark uses 117 vulnerabilities.
The PDF says they filtered to only high-severity vulnerabilities.
It says those vulnerabilities could directly lead to loss of funds.

The evaluation modes are Detect, Patch, Exploit.
Detect uses audit reports as ground truth.
It measures how well the agent finds vulnerabilities.
It is described as recall-focused.

Patch follows a procedure closer to practice.
After the agent fixes code, it checks whether existing tests pass.
Then it checks whether attacks still succeed after the patch.
The intended result is that those exploits fail.

Exploit evaluates attacks programmatically.
The OpenAI intro post says Exploit runs in an isolated local sandbox.
It says the sandbox is Anvil.
It also says a Rust-based harness deploys contracts.
It replays transactions deterministically.
It restricts unsafe RPC methods.
It also mentions a custom grader.
It says they red-teamed the environment.
It says they patched bypass methods they found.

Analysis

EVMbench centers on an end-to-end workflow.
It does not focus on matching a single “correct answer text.”
In real audits, “done” often implies several steps.
Those steps include detection, reproduction, patching, and re-testing.
EVMbench encodes similar steps into scoring.

In Patch mode, scoring depends on two checks.
The patch should keep functionality intact.
This is approximated by passing existing tests.
The patch should also prevent exploitation.

The benchmark also highlights dual-use risk.
Automation can support attacks and defense.
Including an Exploit mode makes that scope explicit.

Representativeness may be limited.
OpenAI states: “EVMbench does not represent the full difficulty of real-world smart contract security.”
Public audit or competition data may miss operational constraints.
It may miss large production systems.
It may miss multi-contract interactions.
It may miss live triage during incidents.
External reviews, from summarized investigation results, have raised label-quality concerns.
They reported items marked high severity that were not exploitable.
Organizations can benefit from checking ground-truth quality internally.

Practical application

The main transferable idea may be the grading rules.
Patch has two stages that can map onto CI.
Stage one checks functional preservation via existing tests.
Stage two checks re-attack resistance via unseen exploit tests.

When introducing an agent, a pass criterion can shift.
It can shift from “found the location” to “patch is deployable.”
Evidence can focus on test execution outputs.
It can rely less on explanatory prose.

Checklist for Today:

Add a Patch gate that requires existing tests to pass and exploit tests to fail.
Run exploit verification on an isolated local chain, such as an Anvil sandbox.
Report agent performance using post-patch regression and attack test outcomes.

FAQ

Q1. What does EVMbench evaluate?
A1. It evaluates Detect, Patch, and sometimes Exploit.
It frames these as an agent workflow for smart contract security.

Q2. How do you verify that a patch is a ‘good patch’?
A2. It checks whether existing tests pass after patching.
It checks whether exploitation fails on unseen exploit tests.
It also resets disallowed test modifications before execution.
That step aims to reduce tampering with test files.

Q3. Can we trust this benchmark as-is to reflect real-world difficulty?
A3. It has stated limitations.
OpenAI says it does not cover full real-world difficulty.
Some reports raise concerns about label quality.
Calibration against internal standards can reduce mismatch risk.

Conclusion

EVMbench evaluates smart contract security as a loop.
That loop includes detection, patching, and re-attack resistance.
The key question is how it will be applied in practice.
It could raise expectations for patch-ready detection.
It could also lower the cost of attack automation.

Aionda