SWE-bench Verified
SWE-bench Verified (human-validated subset)
Section titled “SWE-bench Verified (human-validated subset)”SWE-bench Verified is OpenAI’s 500-instance subset of SWE-bench where every instance was hand-validated by professional engineers: the problem statement is unambiguous, the test patch correctly covers the bug, and the gold patch actually fixes it. It’s the cleanest apples-to-apples surface in the SWE-bench family.
References:
- Announcement: https://openai.com/index/introducing-swe-bench-verified/
- Dataset: https://huggingface.co/datasets/princeton-nlp/SWE-bench_Verified
Status: TODO (full 500 not yet run)
Section titled “Status: TODO (full 500 not yet run)”| Run | Score |
|---|---|
| Chimera, full 500 | NOT RUN |
The adapter is identical to SWE-bench — the SWEBenchVerified class is a thin specialisation with a different default name(). Once the Lite gap closes (see docs/benchmarks/README.md), the Verified run becomes the canonical headline number.
How to run
Section titled “How to run”# Pull the datasethuggingface-cli download princeton-nlp/SWE-bench_Verified \ --repo-type dataset --local-dir ./swe-bench-verified
# Run via Pythonpython -c "from chimera.eval.benchmarks.swe_bench_verified import SWEBenchVerifiedfrom chimera.eval.harness import Harnessbench = SWEBenchVerified(dataset_path='./swe-bench-verified/data.jsonl')print(bench.name(), len(bench.tasks()))"from chimera.eval.benchmarks.swe_bench_verified import SWEBenchVerified
bench = SWEBenchVerified(dataset_path="swe-bench-verified.jsonl")print(bench.name()) # "swe-bench-verified"print(len(bench.tasks())) # 500Task shape
Section titled “Task shape”Same as SWE-bench plus an is_verified: true flag.
Grading
Section titled “Grading”Same FAIL_TO_PASS / PASS_TO_PASS contract as SWE-bench. Run inside per-instance Docker images for byte-identical reproducibility.
Gotchas
Section titled “Gotchas”- A full 500-instance run is expensive — budget 100+ hours of agent runtime and $100–500 in API cost depending on the model.
- The verified subset is biased toward tractable instances. A high Verified score does not extrapolate to the full 2,294-instance SWE-bench.
- Use SWE-bench top-20 for a fast smoke run before committing to the full sweep.