Skip to content

SWE-bench Verified

SWE-bench Verified (human-validated subset)

Section titled “SWE-bench Verified (human-validated subset)”

SWE-bench Verified is OpenAI’s 500-instance subset of SWE-bench where every instance was hand-validated by professional engineers: the problem statement is unambiguous, the test patch correctly covers the bug, and the gold patch actually fixes it. It’s the cleanest apples-to-apples surface in the SWE-bench family.

References:

RunScore
Chimera, full 500NOT RUN

The adapter is identical to SWE-bench — the SWEBenchVerified class is a thin specialisation with a different default name(). Once the Lite gap closes (see docs/benchmarks/README.md), the Verified run becomes the canonical headline number.

Terminal window
# Pull the dataset
huggingface-cli download princeton-nlp/SWE-bench_Verified \
--repo-type dataset --local-dir ./swe-bench-verified
# Run via Python
python -c "
from chimera.eval.benchmarks.swe_bench_verified import SWEBenchVerified
from chimera.eval.harness import Harness
bench = SWEBenchVerified(dataset_path='./swe-bench-verified/data.jsonl')
print(bench.name(), len(bench.tasks()))
"
from chimera.eval.benchmarks.swe_bench_verified import SWEBenchVerified
bench = SWEBenchVerified(dataset_path="swe-bench-verified.jsonl")
print(bench.name()) # "swe-bench-verified"
print(len(bench.tasks())) # 500

Same as SWE-bench plus an is_verified: true flag.

Same FAIL_TO_PASS / PASS_TO_PASS contract as SWE-bench. Run inside per-instance Docker images for byte-identical reproducibility.

  • A full 500-instance run is expensive — budget 100+ hours of agent runtime and $100–500 in API cost depending on the model.
  • The verified subset is biased toward tractable instances. A high Verified score does not extrapolate to the full 2,294-instance SWE-bench.
  • Use SWE-bench top-20 for a fast smoke run before committing to the full sweep.