Skip to content

Mink Benchmarks

Benchmark adapters that ship with Chimera and can be driven by the evaluation harness (chimera/eval/harness.py).

A benchmark in Chimera is a Benchmark subclass (chimera/eval/harness.py) exposing three methods: name(), tasks(), evaluate(task, output, env). The Harness runs an agent against every task, optionally per-task in a fresh Environment, then aggregates pass rate, total cost, and per-task results into an EvalResult.

Adapter status is one of:

  • validated — adapter has unit tests and/or a recorded GLM-5/GLM-5.1 baseline in data/.
  • scaffolded — adapter shape is in place (loader, tasks(), evaluate()) but has not been driven against a real dataset / Docker harness in this repo. Follow-up issue tracks the gap.

Status below was reconstructed from research/mink/A{9,10,11,14,17}-REPORT.md (the reports that landed before the polling cutoff), source files, chimera/eval/benchmarks/__init__.py, and the GitHub issue comments on #84-#96.

BenchmarkIssueStatusFileBaseline / Notes
SWE-bench Lite#84scaffoldedchimera/eval/benchmarks/swe_bench.py10% (2/20) GLM-5.1, 20 smallest patches
SWE-bench Verified#84scaffoldedchimera/eval/benchmarks/swe_bench_verified.pyadapter + 500-step / IPython / condense plumbing
Terminal-Bench 2.0#85validatedchimera/benchmarks/terminal_bench_agent.py30% (3/10) GLM-5; follow-up #139
FeatureBench#86scaffoldedchimera/eval/benchmarks/feature_bench.pyneeds HF dataset + Docker images
Cline Bench#87scaffoldedchimera/eval/benchmarks/cline_bench.pyneeds RL container images
DPAI Arena#88scaffoldedchimera/eval/benchmarks/dpai_arena.pyJava/Spring; six tracks; no baseline
SWT-Bench#89scaffoldedchimera/eval/benchmarks/swt_bench.py15 unit tests pass; needs Docker + C metric
tau-bench#90scaffoldedchimera/eval/benchmarks/tau_bench.pyneeds tau2-bench upstream package
Context-Bench (Letta)#91scaffoldedchimera/eval/benchmarks/context_bench.pyneeds Letta evals dataset
SWE-PolyBench#92scaffoldedchimera/eval/benchmarks/swe_polybench.pyneeds HF dataset + JS/TS/Java toolchains
HumanEval+#93scaffoldedchimera/eval/benchmarks/humaneval_plus.pyneeds evalplus extras
MBPP#94scaffoldedchimera/eval/benchmarks/mbpp.pylocal JSON loader; sanitized split recommended
LiveCodeBench#95scaffoldedchimera/eval/benchmarks/livecodebench.pydate-window filter for contamination control
MATH-500 / AIMO#96scaffoldedchimera/eval/benchmarks/math500.py, aimo.pyAIMO has live-LLM tests; MATH-500 loader-only
WebArenan/ascaffoldedchimera/eval/benchmarks/webarena.pystring_match + url_match; no upstream sandbox
HumanEval (base)n/avalidatedchimera/eval/benchmarks/human_eval.py66.5% (109/164) GLM-5.1; raw in data/
Aider Polyglotn/ascaffoldedchimera/eval/benchmarks/aider_polyglot.py6 langs; diff-match + test-cmd; shrew wrapper
Customn/avalidatedchimera/eval/benchmarks/custom.pyuser-defined tasks; in-tree tests

Issue links: https://github.com/0bserver07/chimera/issues/<N>.

Real GitHub issues with test verification. SWEBench loads SWEBenchInstance records from JSON / JSONL (or the {"tasks": [...]} wrapper) and evaluates by applying test_patch in the supplied environment, then running env.run_tests().

  • File: chimera/eval/benchmarks/swe_bench.py
  • Tests: tests/eval/test_swe_bench.py (11 unit tests), tests/eval/test_bench_swe.py
  • Baseline: 10% (2/20) on SWE-bench Lite, 20 smallest patches, GLM-5.1. Raw in data/swebench-lite-glm51-results.jsonl.
  • Run: chimera eval --benchmark swe-bench --dataset path/to/instances.jsonl
  • Full run example: examples/benchmarks/swe_bench_proper.py, examples/benchmarks/swe_bench_docker.py.

SWE-bench Verified (#84, dedicated adapter)

Section titled “SWE-bench Verified (#84, dedicated adapter)”

The Verified split is a 500-task human-validated subset of SWE-bench Full with cleaner problem statements and deterministic test specifications. The dataset schema is identical to Lite (so the loader inherits from SWEBench); the differences live in the agent configuration the adapter recommends.

  • File: chimera/eval/benchmarks/swe_bench_verified.py
  • Tests: tests/eval/benchmarks/test_swe_bench_verified.py (24 unit tests covering variant config, max-step plumbing, IPython tool surface, condensation trigger).
  • Baseline: not yet run live. Lite baseline (10%) is the reference point; Verified live run is open follow-up under issue #84.
  • Configuration knobs (with their defaults):
    • max_steps=500 — Verified default. Lite default is 100. The step budget is exposed as bench.max_steps and as bench.config.max_steps for callers to plug into a LoopConfig.
    • ipython=True — when set, bench.build_ipython_tool() returns a chimera.tools.ipython.IPythonTool instance. The tool wraps a stateful ipython --no-banner (or python -i -u fallback) subprocess so variables, imports, and instrumentation persist across tool calls. Each session is single-threaded; supply a fresh tool per task for clean state.
    • condense_every_n_steps=25 — every N steps the agent loop should call bench.should_condense(step); when it returns True, run bench.build_condensation(provider=...) to get a SummaryCompaction and apply it to the message log. 0 disables condensation entirely (matching Lite behavior).
  • Helpers: SWEBenchConfig.for_lite(...) and SWEBenchConfig.for_verified(...) build the recommended runtime config for callers that don’t want to subclass.
  • Run (loader only — Docker still required for live eval):
    from chimera.eval.benchmarks import SWEBenchVerified
    bench = SWEBenchVerified(
    dataset_path="path/to/swe-bench-verified.jsonl",
    max_steps=500,
    ipython=True,
    condense_every_n_steps=25,
    )
    for task in bench.tasks():
    ... # drive the agent; max_steps/IPython/condense via bench.config
  • Status: scaffolded only — adapter, config, IPython tool, and the should_condense trigger are wired and unit-tested. A live run on the Verified Docker harness is the next milestone.

Containerised terminal tasks under tb. Chimera wraps tasks as a ChimeraAgent(BaseAgent) thin ReAct loop that drives a TmuxSession through provider.complete().

  • File: chimera/benchmarks/terminal_bench_agent.py (168 LoC)
  • Baseline: 30% (3/10) GLM-5, 2026-03-20. See docs/benchmarks/2026-03-30-terminal-bench-glm5.md.
  • Follow-up: issue #139 lists adaptive-wait, max_turns 30 -> 50, richer system prompt, error recovery, swap to claude_code preset.
  • Run: requires pip install terminal-bench and Docker; invoke via tb run --agent chimera ... once configured.

End-to-end Python feature development with a test-driven grader.

  • File: chimera/eval/benchmarks/feature_bench.py
  • Loader: local JSON / JSONL plus opt-in load_from_hub('LiberCoders/FeatureBench').
  • evaluate() chains env.run_tests(test_files) -> env.run_command('python -m pytest -x ...') -> non-empty-output fallback.
  • Status: scaffolded only; needs HF dataset pull and ~13 Docker images.
  • Run: uv run python -c "from chimera.eval.benchmarks import FeatureBench; b = FeatureBench(dataset_path='...'); ...".

Real-world engineering tasks from Cline user sessions, packaged as Docker RL environments with binary test-suite graders.

  • File: chimera/eval/benchmarks/cline_bench.py
  • Loader: directory of per-task JSON, single JSON file, or JSONL.
  • Status: scaffolded only; needs the upstream cline/cline-bench task definitions and container images.

JetBrains Developer Productivity AI Arena: Java/Spring tasks across six tracks (issue-to-patch, pr-review, coverage, static-analysis, upgrade, compliance).

  • File: chimera/eval/benchmarks/dpai_arena.py
  • Status: scaffolded only; needs the Spring task corpus and per-track grader wiring.

Test-generation analogue of SWE-bench: agent must produce tests that fail on the buggy base and pass after the gold patch.

  • File: chimera/eval/benchmarks/swt_bench.py
  • Modes: unit_test (integrate into suite), reproduction (script exit codes).
  • Tests: tests/eval/test_bench_swt.py (15 tests, all passing).
  • Status: F2P contract enforced in-process; deferred work covers Change-Coverage (C) metric, predictions JSONL writer, and Docker smoke run on the Lite subset.

Multi-turn tool-use and conversational agent evaluation across airline / retail / telecom / banking domains. Stateful: end-state DB is compared against the annotated goal; reliability is pass^k.

  • File: chimera/eval/benchmarks/tau_bench.py
  • Tests: tests/eval/benchmarks/test_tau_bench.py (39 tests, dataset-absent skip path + scoring logic).
  • Status: wired. Full simulated-environment execution still requires the upstream tau-bench / tau2-bench package — we do not vendor or pip-install upstream. The adapter loads task definitions from a local directory, normalises them, and scores in-process via terminal-action match (with a goal_state fallback when present). When an upstream env exposing evaluate_task(task, output) is passed in, the adapter delegates to it.
Terminal window
# 1. Clone upstream tasks (read-only — we never pip install it):
git clone https://github.com/sierra-research/tau-bench /tmp/tau-bench
# 2. Stage the JSON task dumps under the default dataset dir:
mkdir -p ~/.chimera/datasets/tau-bench
cp /tmp/tau-bench/tau_bench/envs/retail/tasks_train.json \
~/.chimera/datasets/tau-bench/retail_train.json
cp /tmp/tau-bench/tau_bench/envs/airline/tasks.json \
~/.chimera/datasets/tau-bench/airline.json
# 3. Smoke-run the adapter:
uv run python -m chimera.eval.benchmarks.tau_bench --limit 3 --domain airline

Override the dataset directory with CHIMERA_TAU_BENCH_PATH=/path/to/dir.

FlagDefaultDescription
--domainairlineOne of airline, retail, telecom, banking, mock.
--limit3Maximum tasks to run.
--modelglm-5Provider model id passed to create_provider().
--datasetenv / defaultOverride the dataset path (file or directory).
--no-coloroffDisable ANSI colour in the results table.

When the dataset is absent the CLI prints a friendly setup hint and exits with status 2 — safe to wire into CI smoke gates.

The in-process evaluator matches the agent’s terminal action (name + arguments) against the annotated actions[-1] from the task JSON. This mirrors the upstream tau-bench convention: only the final mutating call needs to match, since that’s the call that drives the database into the goal state. Two acceptable agent output shapes:

{"actions": [{"name": "cancel_reservation", "arguments": {"id": "r1"}}]}

or, when the task carries a goal_state field:

{"final_state": {"reservations": []}}

Plain-text outputs are scored leniently against the terminal action name (substring match) — useful for early scaffold runs before the agent reliably emits structured tool-call traces.

Letta long-running-context benchmark. Programmatic SQL-derived questions over a fictional-entity database; agent must navigate semi-structured text files with grep/open-style tools.

  • File: chimera/eval/benchmarks/context_bench.py
  • Suites: filesystem (default), skills.
  • Status: scaffolded only; lazy-loads the Letta evals framework and falls back to a user-supplied JSON dataset offline.

Multi-language repository-level benchmark (Python / Java / JS / TS).

  • File: chimera/eval/benchmarks/swe_polybench.py
  • Filters: split in {full, pb500, verified}, language in {python, java, javascript, typescript}, limit.
  • evaluate() applies test_patch then runs the language-appropriate command (pytest -x, npm test --silent, mvn -q test).
  • Extra metrics: localization_accuracy() (file-level recall), cst_node_recall() (CST-node recall, paper-specific).
  • Status: scaffolded only; needs HF dataset dump and JS/TS/Java toolchain images.

EvalPlus extension to HumanEval with ~80x more test cases per problem; exposes brittle solutions.

  • File: chimera/eval/benchmarks/humaneval_plus.py
  • Status: scaffolded only; pulls from the optional evalplus package when installed, falls back to local JSONL otherwise.

974-problem entry-level Python benchmark; sanitized split is 427 hand-verified problems.

  • File: chimera/eval/benchmarks/mbpp.py
  • Loader: local JSON / JSONL only (zero-dependency core; no HF import).
  • evaluate() runs the test_list asserts in-process or via env.run_command.
  • Status: scaffolded only; needs a downloaded MBPP dataset file.

Contamination-controlled competitive-programming benchmark from LeetCode / AtCoder / CodeForces. Each problem is timestamped so evaluation can restrict to post-cutoff problems.

  • File: chimera/eval/benchmarks/livecodebench.py
  • Date-window helpers: LiveCodeBench(start_date=..., end_date=...), LiveCodeBench.rotated_window(model_cutoff=..., months=3).
  • Scenarios: codegeneration wired; selfrepair, codeexecution, testoutput raise NotImplementedError until the upstream JSON schema is wired in.

Mathematical reasoning. AIMO is the AI Mathematical Olympiad adapter; MATH-500 is the 500-problem subset of MATH covering seven competition-math subjects.

  • Files: chimera/eval/benchmarks/aimo.py, chimera/eval/benchmarks/math500.py
  • Tests: tests/eval/test_bench_aimo.py, tests/eval/test_aimo_integration.py (latter is live-LLM).
  • AIMO answer extraction handles ANSWER: <n>, \boxed{<n>}, and trailing-integer fallback.
  • MATH-500 evaluator does normalised string equivalence first, then optional sympy symbolic equivalence when installed.
  • Run: chimera eval --benchmark aimo --dataset path/to/aimo.json.

Web-agent benchmark: realistic tasks across self-hosted e-commerce, GitLab, CMS, and Reddit sandbox sites. Each task carries an intent (natural-language goal), start_url, and an eval_types list declaring how success is judged: string_match (against a reference_answer or compound reference_answers map), url_match (against reference_url), and program_html (programmatic DOM checks).

  • File: chimera/eval/benchmarks/webarena.py
  • Tests: tests/eval/benchmarks/test_webarena.py (47 unit tests covering dataset-absent skip, JSON / JSONL load, scoring round-trip for both string_match and url_match, compound reference_answers, combined eval-types AND semantics, and the upstream env escape hatch).
  • Status: scaffolded — string_match + url_match scorers are wired in-process. program_html is recognised but deferred (fails closed so a stub never falsely scores). Full execution requires the upstream sandbox sites (Docker — heavyweight) plus the upstream webarena package for DOM/accessibility observations. We do not vendor or pip-install upstream — the licence on the task corpus is unclear.
Terminal window
# 1. Clone upstream tasks (read-only — we never pip install it):
git clone https://github.com/web-arena-x/webarena /tmp/webarena
# 2. Stage the JSON task config dump under the default dataset dir:
mkdir -p ~/.chimera/datasets/webarena
cp /tmp/webarena/config_files/test.raw.json \
~/.chimera/datasets/webarena/test.json
# 3. Stand up the upstream sandbox sites (Docker, heavyweight):
# https://github.com/web-arena-x/webarena/blob/main/environment_docker/README.md

Override the dataset directory with CHIMERA_WEBARENA_PATH=/path/to/dir.

The agent’s output may be either a JSON envelope {"answer": "...", "url": "..."} or two named lines:

ANSWER: Widget Pro Max
URL: http://shop.example.test/p/widget-pro-max

url_match compares scheme + netloc + path (query + fragment ignored, trailing slash normalised). string_match lowercases and collapses whitespace before comparing; the upstream reference_answers.must_include / fuzzy_match / exact_match shape is honoured. When multiple eval types are declared, all must pass.

When an upstream env exposing evaluate_task(task, output) is passed in, the adapter delegates to it — same escape hatch as tau-bench.

Original HumanEval — 164 hand-written Python problems.

  • File: chimera/eval/benchmarks/human_eval.py
  • Tests: tests/eval/test_bench_human_eval.py
  • Baseline: 66.5% pass@1 (109/164), GLM-5.1. Raw in data/humaneval-glm51-results.json. (Earlier 90.9% GLM-5 figure from project memory predates the recorded raw data.)
  • Run: chimera eval --benchmark human-eval --dataset path/to/humaneval.json.

Multi-language coding benchmark from github.com/Aider-AI/polyglot-benchmark. Drawn from Exercism exercises across six target languages: Python, JavaScript, Rust, Go, Java, C++. Each task ships a stub plus a read-only test file; the agent fills in the stub and is graded by either expected-file diff-match or by running the language’s test command and checking the exit code.

This adapter is the general flavour usable by every Chimera CLI (otter / weasel / shrew / mink / ferret). The shrew flavour at chimera/shrew/benchmarks/aider_polyglot.py is a thin subclass that exposes a small-model-friendly default language subset (SHREW_DEFAULT_LANGUAGES = python, javascript, rust, go) — Java and C++ need toolchains that aren’t always installed on a small-model laptop.

  • File: chimera/eval/benchmarks/aider_polyglot.py
  • Tests: tests/eval/benchmarks/test_aider_polyglot.py (26 unit tests covering ABC conformance, dataset loading, single + multi-language filters, env-var override, diff-match scorer, test-command scorer).
  • Dataset: not vendored — licenses are mixed. Stage locally under ~/.chimera/datasets/aider-polyglot/ or override with CHIMERA_AIDER_POLYGLOT_PATH=/abs/path.
  • Constructor: AiderPolyglot(dataset_path=None, limit=None, languages=None, language=None). The list-form languages=[...] filter wins when both are supplied; the single-form language="..." is preserved for back-compat.
  • Run (loader only):
    from chimera.eval.benchmarks import AiderPolyglot
    bench = AiderPolyglot(languages=["python", "rust"], limit=10)
    for task in bench.tasks():
    ... # drive the agent; bench.evaluate(task, output, env) -> bool
Terminal window
# 1. Clone upstream (read-only — we never pip install it):
git clone https://github.com/Aider-AI/polyglot-benchmark /tmp/polyglot
# 2. Stage tasks.json under the default dataset dir:
mkdir -p ~/.chimera/datasets/aider-polyglot
# (author tasks.json from the upstream tree — see the adapter docstring
# for the per-task schema)
# 3. Smoke-run via shrew:
chimera shrew bench aider-polyglot --bench-limit 5 --language python

Per task in tasks.json:

{
"id": "python/hello-world",
"language": "python",
"prompt": "Implement hello().",
"expected_files": {"hello_world.py": "def hello():\n return 'Hello, World!'\n"},
"test_command": "pytest -x -q",
"exercise_dir": "hello-world",
"timeout_s": 90
}

Either expected_files or test_command (or both) must be present. When both are set, diff-match is tried first; the test command is the fallback. The exercise_dir is the subdir under <dataset_root>/exercises/ to use as the cwd for test_command.

User-defined task list or directory of task JSON. Useful for one-off harness runs and integration smoke tests.

  • File: chimera/eval/benchmarks/custom.py
  • Tests: tests/eval/test_bench_custom.py
  • Run: chimera bench --suite custom --tasks-dir path/to/tasks/.

CLI front door (registered names: human-eval, humaneval, swe-bench, swebench, aimo, custom):

Terminal window
chimera eval --benchmark swe-bench --dataset path/to/instances.jsonl --limit 10 --output results.json
chimera bench --suite custom --tasks-dir path/to/tasks/ --output results.json

The scaffolded adapters above (feature_bench, cline_bench, dpai_arena, swt_bench, tau_bench, context_bench, swe_polybench, humaneval_plus, mbpp, livecodebench, math500) are not yet wired into _BENCHMARKS in chimera/cli/main.py. Drive them directly through the harness:

Terminal window
uv run python - <<'PY'
from chimera.eval.benchmarks import SWTBench
from chimera.eval.harness import Harness
bench = SWTBench(dataset_path="path/to/swt.jsonl", mode="unit_test")
# harness = Harness(benchmark=bench, agent=my_agent, env_factory=my_env_factory)
# print(harness.run().pass_rate)
PY

To add an adapter to the CLI, append an entry to _BENCHMARKS in chimera/cli/main.py and update _load_benchmark if its constructor takes anything beyond dataset_path / limit.

Subclass chimera.eval.harness.Benchmark and implement three methods:

from chimera.eval.harness import Benchmark
class MyBench(Benchmark):
def name(self) -> str: ...
def tasks(self) -> list[dict]: ... # each task needs at least 'id', 'prompt'
def evaluate(self, task, agent_output, env) -> bool: ...

tasks() should return dicts shaped for whatever Agent.run(prompt, env) your harness uses. evaluate() receives the original task dict, the agent’s stringified output, and (optionally) the per-task Environment.

Drop the file under chimera/eval/benchmarks/, export from chimera/eval/benchmarks/__init__.py, and wire into the CLI map if you want a chimera eval --benchmark <name> shortcut. See the SWE-bench adapter (chimera/eval/benchmarks/swe_bench.py) for a complete reference implementation, and chimera/eval/benchmarks/README.md for additional notes on the SWE-bench scaffold.