SWE-PolyBench
SWE-PolyBench (multi-language)
Section titled “SWE-PolyBench (multi-language)”SWE-PolyBench (Amazon Science, 2025) is a repository-level coding benchmark that extends the SWE-bench format across four languages: Python, Java, JavaScript, and TypeScript. Each task is a real GitHub bug fix, feature addition, or refactor; agents are graded by running the upstream test suite against their patch.
References:
- HuggingFace dataset: https://huggingface.co/datasets/AmazonScience/SWE-PolyBench
- GitHub: https://github.com/amazon-science/SWE-PolyBench
- Paper: arXiv:2504.08703
Splits
Section titled “Splits”| Split | Instances | Notes |
|---|---|---|
pb500 | 500 | The standard headline split — balanced across languages and task types. |
verified | smaller | Human-validated subset. |
full | all | Full collected pool. Rarely run end-to-end. |
Languages: python · java · javascript · typescript. Pass language= to filter to one.
How to run
Section titled “How to run”# 1. Pull the dataset to disk. The upstream lives on HuggingFace; one easy# path is the datasets library:python -c "from datasets import load_datasetimport jsonds = load_dataset('AmazonScience/SWE-PolyBench', split='train')with open('swe-polybench.jsonl', 'w') as f: for row in ds: f.write(json.dumps(dict(row)) + '\n')"
# 2. Run via the bench harnesschimera bench swe-polybench --dataset swe-polybench.jsonl --limit 10Programmatic use:
from chimera.eval.benchmarks.swe_polybench import SWEPolyBenchfrom chimera.eval.harness import Harness
bench = SWEPolyBench( dataset_path="swe-polybench.jsonl", split="pb500", language="python", # or None for all four limit=10,)
print(bench.name()) # "swe-polybench"print(len(bench.tasks())) # 10Constructor arguments
Section titled “Constructor arguments”| Argument | Default | Notes |
|---|---|---|
dataset_path | None | Path to JSON or JSONL file with instances. None keeps the benchmark empty (useful for smoke tests / programmatic seed). |
split | "pb500" | One of full, pb500, verified. Used as a filter when records carry a "split" field. |
language | None | Optional filter; one of python, java, javascript, typescript. |
limit | None | Max tasks to keep after filtering. |
Invalid split or language raises ValueError. Missing dataset_path raises FileNotFoundError.
Instance shape
Section titled “Instance shape”Each task is a SWEPolyBenchInstance dataclass with:
instance_id,repo,base_commitproblem_statement— natural-language issue textlanguage— one of the fourtask_type—bug_fix,feature,refactortest_patch— the upstream gold test diffpatch— the gold solution patchmodified_files— files the gold patch touchescst_nodes— CST-level annotations used by some upstream analyseshints_text— optional hints from the original GitHub thread
Grading
Section titled “Grading”evaluate(task, agent_output) runs the language-appropriate test command against the agent’s patched workspace:
| Language | Test runner |
|---|---|
| Python | pytest -x |
| JavaScript | npm test --silent |
| TypeScript | npm test --silent |
| Java | mvn -q test |
A task passes if the test runner exits 0.
Status
Section titled “Status”Adapter is implemented and registered. Live tier-1 runs against real models are tracked under the benchmark transparency framework — contribute a run to data/swe-polybench-<model>-results.jsonl to seed the comparison table.
Why this benchmark matters
Section titled “Why this benchmark matters”Most agent benchmarks are Python-only. SWE-PolyBench is the cleanest existing test of cross-language generalization for a coding agent — the same model has to navigate setup.py and pom.xml and package.json without losing the plot. Coupled with chimera/ferret’s sandbox-first execution and the language-aware runner table, it’s the right benchmark to ask “does my agent know what kind of project it’s in?”
Related
Section titled “Related”- SWE-bench — Python-only ancestor
- MultiSWE-bench — sibling, similar multi-language posture
- Benchmarks overview