Skip to content

SWE-PolyBench

SWE-PolyBench (Amazon Science, 2025) is a repository-level coding benchmark that extends the SWE-bench format across four languages: Python, Java, JavaScript, and TypeScript. Each task is a real GitHub bug fix, feature addition, or refactor; agents are graded by running the upstream test suite against their patch.

References:

SplitInstancesNotes
pb500500The standard headline split — balanced across languages and task types.
verifiedsmallerHuman-validated subset.
fullallFull collected pool. Rarely run end-to-end.

Languages: python · java · javascript · typescript. Pass language= to filter to one.

Terminal window
# 1. Pull the dataset to disk. The upstream lives on HuggingFace; one easy
# path is the datasets library:
python -c "
from datasets import load_dataset
import json
ds = load_dataset('AmazonScience/SWE-PolyBench', split='train')
with open('swe-polybench.jsonl', 'w') as f:
for row in ds:
f.write(json.dumps(dict(row)) + '\n')
"
# 2. Run via the bench harness
chimera bench swe-polybench --dataset swe-polybench.jsonl --limit 10

Programmatic use:

from chimera.eval.benchmarks.swe_polybench import SWEPolyBench
from chimera.eval.harness import Harness
bench = SWEPolyBench(
dataset_path="swe-polybench.jsonl",
split="pb500",
language="python", # or None for all four
limit=10,
)
print(bench.name()) # "swe-polybench"
print(len(bench.tasks())) # 10
ArgumentDefaultNotes
dataset_pathNonePath to JSON or JSONL file with instances. None keeps the benchmark empty (useful for smoke tests / programmatic seed).
split"pb500"One of full, pb500, verified. Used as a filter when records carry a "split" field.
languageNoneOptional filter; one of python, java, javascript, typescript.
limitNoneMax tasks to keep after filtering.

Invalid split or language raises ValueError. Missing dataset_path raises FileNotFoundError.

Each task is a SWEPolyBenchInstance dataclass with:

  • instance_id, repo, base_commit
  • problem_statement — natural-language issue text
  • language — one of the four
  • task_typebug_fix, feature, refactor
  • test_patch — the upstream gold test diff
  • patch — the gold solution patch
  • modified_files — files the gold patch touches
  • cst_nodes — CST-level annotations used by some upstream analyses
  • hints_text — optional hints from the original GitHub thread

evaluate(task, agent_output) runs the language-appropriate test command against the agent’s patched workspace:

LanguageTest runner
Pythonpytest -x
JavaScriptnpm test --silent
TypeScriptnpm test --silent
Javamvn -q test

A task passes if the test runner exits 0.

Adapter is implemented and registered. Live tier-1 runs against real models are tracked under the benchmark transparency framework — contribute a run to data/swe-polybench-<model>-results.jsonl to seed the comparison table.

Most agent benchmarks are Python-only. SWE-PolyBench is the cleanest existing test of cross-language generalization for a coding agent — the same model has to navigate setup.py and pom.xml and package.json without losing the plot. Coupled with chimera/ferret’s sandbox-first execution and the language-aware runner table, it’s the right benchmark to ask “does my agent know what kind of project it’s in?”