SWE-bench
SWE-bench (real GitHub issues)
Section titled “SWE-bench (real GitHub issues)”SWE-bench (Princeton, 2023) gives an agent a real GitHub issue plus a frozen repository snapshot. The agent edits the codebase; the patch is graded by running the issue’s FAIL_TO_PASS and PASS_TO_PASS test sets.
References:
- Website: https://www.swebench.com/
- Paper: arXiv:2310.06770
- GitHub: https://github.com/princeton-nlp/SWE-bench
Variants
Section titled “Variants”| Variant | Instances | Notes |
|---|---|---|
| Lite | 300 (filtered for tractability) | The most common comparison surface. |
| Lite top-20 | 20 (smallest patches, hand-picked) | Smoke set used in Chimera’s reports. |
| Verified | 500 | OpenAI’s human-validated subset; see swe-bench-verified. |
| Full | 2,294 | All collected instances. Rarely run end-to-end. |
Status: GAP (active work to close the score)
Section titled “Status: GAP (active work to close the score)”| Run | Score | Notes |
|---|---|---|
| Chimera + GLM-5.1, top-20 Lite | 10% (2 / 20) | Report |
| GLM-5 + OpenHands (reference) | 77.8% Lite | Published. |
Root-cause analysis in docs/benchmarks/README.md — five gaps identified, in priority order: max iterations (100 → 500), action space (bash → bash+IPython), LLM-based condensation, native str_replace, multi-action turns.
How to run
Section titled “How to run”# Docker-isolated, proper FAIL_TO_PASS / PASS_TO_PASSpython examples/benchmarks/swe_bench_proper.py --count 10
# Canonical Lite top-20 (matches the 10% number above)python examples/benchmarks/swe_bench_lite_run.py --count 20 --max-steps 50from chimera.eval.benchmarks import SWEBenchfrom chimera.eval.harness import Harness
bench = SWEBench(dataset_path="swe-bench-lite.jsonl", limit=20)print(bench.name()) # "swe-bench"print(len(bench.tasks())) # 20
harness = Harness(agent=my_agent, benchmark=bench)results = harness.run()print(results.pass_rate())Task shape
Section titled “Task shape”{ "instance_id": "django__django-12345", "repo": "django/django", "base_commit": "abc123", "problem_statement": "Fix the bug where ...", "test_patch": "...", "FAIL_TO_PASS": ["tests.test_x::test_y"], "PASS_TO_PASS": ["tests.test_x::test_z"]}Gotchas
Section titled “Gotchas”- Docker required for proper grading. The
swe_bench_proper.pyrunner pulls a per-instance image (princeton-nlp/sweb.eval.x86_64.<instance_id>); first run can take 15+ min and >50 GB disk. - Cost ceiling. A 20-instance run with 50 max-steps costs $30–60 on Claude Sonnet, $5–10 on GLM-5.
- Rather than the cherry-picked top-20, use
swe-bench-verifiedfor human-validated numbers comparable to published research.