Aider Polyglot
Aider Polyglot (multi-language code edits)
Section titled “Aider Polyglot (multi-language code edits)”Aider Polyglot is the standard “code edit” benchmark — instead of generating code from scratch, the agent receives a buggy or incomplete file and must edit it. 225 exercises across Python, JavaScript, Rust, Go, C++, and Java.
References:
- Aider leaderboard: https://aider.chat/docs/leaderboards/
- GitHub (exercises): https://github.com/Aider-AI/polyglot-benchmark
Status: TODO (adapter wired; not benchmarked yet)
Section titled “Status: TODO (adapter wired; not benchmarked yet)”| Run | Score |
|---|---|
| Chimera | NOT RUN |
The adapter supports two grading modes that coexist on a per-task basis:
| Mode | Grader |
|---|---|
diff-match | Compare the agent’s patch against the gold patch line-for-line. Strict. |
test-pass | Run the language’s native test runner against the patched tree. Loose. |
Diff-match takes precedence; falls through to test-pass when the task carries tests but no canonical patch.
How to run
Section titled “How to run”git clone https://github.com/Aider-AI/polyglot-benchmark ~/.chimera/datasets/aider-polyglotfrom chimera.eval.benchmarks import AiderPolyglotfrom chimera.eval.harness import Harness
# All languagesbench = AiderPolyglot(dataset_path="~/.chimera/datasets/aider-polyglot")print(bench.name()) # "aider-polyglot"
# Filterbench = AiderPolyglot( dataset_path="~/.chimera/datasets/aider-polyglot", languages=["python", "rust"],)print(bench.name()) # "aider-polyglot:python+rust"
harness = Harness(agent=my_agent, benchmark=bench)results = harness.run()Task shape
Section titled “Task shape”{ "exercise": "reverse-string", "language": "rust", "instructions": "Reverse a string. ...", "files": {"src/lib.rs": "pub fn reverse(s: &str) -> String { todo!() }"}, "tests_path": "tests/reverse_string.rs"}Grading
Section titled “Grading”For tests-mode: the agent’s patch is applied, then cargo test / pytest / go test / npm test is run inside the workspace. Tools live under chimera/eval/benchmarks/runners/.
Gotchas
Section titled “Gotchas”- The dataset is not pip-installable — clone the GitHub repo directly.
- Each language brings its own toolchain (
rustc,cargo,go, etc.). Run inside a fat Docker image or skip the languages you don’t have set up. - Aider’s own leaderboard uses test-pass mode with their custom diff format. To match their numbers, use
mode="test-pass"and Aider’s edit format.