Shrew benchmarks

Benchmarks

Shrew ships a small benchmark harness for evaluating small-model coding capability. Two benchmarks are wired today:

Aider Polyglot — per-language code-edit tasks scored by diff-match or test-pass.
GAIA — research-task Q&A scored by GAIA-style answer match.

A third (terminal-bench) is reserved by the parser for forward compat but not yet wired. The harness lives in chimera/shrew/benchmarks/.

Why these two

Aider Polyglot exercises code-editing competence: can the agent read a stub file, understand the test, and produce a working implementation? It’s the closest analogue to the day-to-day work shrew is built for.

GAIA exercises multi-step reasoning under tool use: can the agent decompose a research question, pick the right tool, and arrive at a single short answer? It catches the failure mode where small models can edit code but lose the plot on multi-hop questions.

Together they bracket the small-model coding agent posture: tight toolbox, real tasks, deterministic scoring.

Command surface

chimera shrew bench aider-polyglot --bench-limit 5
chimera shrew bench gaia --bench-limit 5

Flags:

Flag	Default	Meaning
`--bench-limit N`	`5`	Max tasks to run; pass `0` for full run.
`--model <id>`	`qwen3.6-35b-a3b`	Same shrew model resolution.
`--cwd <dir>`	`.`	Working directory for the agent.

The harness builds a default agent via build_shrew_agent_for_eval(). That helper assembles the full AGENT_TOOLS group (not the small-model --allowed-tools subset) because benchmark runs deliberately use the broadest tool surface so the model has every chance to succeed; the small-model defaults bite at production-call time, not eval time.

Exit codes

Code	Meaning
`0`	Benchmark ran and at least one task passed.
`1`	Benchmark ran, nothing passed.
`2`	Malformed invocation (missing or unknown benchmark name).
`3`	Dataset not staged, or runtime failure during the run.

Exit 3 is the “needs setup” signal. Outer CI scripts can treat it distinctly from “ran but nothing passed”.

Staging Aider Polyglot

The polyglot benchmark is the Exercism polyglot exercise corpus plus a per-task index. Shrew does not vendor the dataset — licenses are mixed and we don’t ship third-party content.

Default location

~/.chimera/datasets/aider-polyglot/
    tasks.json
    exercises/<id>/
        stub.py
        <id>_test.py
        ...

Override via $CHIMERA_AIDER_POLYGLOT_PATH=/abs/path/to/dir.

`tasks.json` schema

A list of task dicts:

[
  {
    "id": "python/hello-world",
    "language": "python",
    "prompt": "Implement the hello() function so the test passes.",
    "expected_files": {
      "hello_world.py": "def hello():\n    return 'Hello, World!'\n"
    },
    "test_command": "pytest -x -q",
    "exercise_dir": "hello-world",
    "timeout_s": 90
  }
]

Key	Required	Meaning
`id`	yes	Used as task id; should be unique.
`language`	no	Threaded into the prompt.
`prompt`	yes	Agent prompt body.
`expected_files`	no	Diff-match scoring (byte-for-byte).
`test_command`	no	Test-pass scoring (subprocess).
`exercise_dir`	no	Subdir under `exercises/` to stage.
`timeout_s`	no	Test command timeout, default 90.

When both expected_files and test_command are present, expected_files wins. When only test_command is present, evaluate() runs the command from the staged exercise copy and passes when the exit code is zero.

Setup steps

Clone the polyglot exercise corpus locally — the upstream Aider project hosts a recipe for assembling it from the Exercism tracks. Follow their instructions; do not vendor it into chimera.
Author tasks.json in the schema above. Start with five tasks, confirm the harness runs end-to-end, then expand.
Optionally stage exercise trees under exercises/<id>/. The harness copies these into the agent’s working directory at the start of each task.

Run it

chimera shrew bench aider-polyglot --bench-limit 5

When the dataset is missing, shrew prints a setup hint with the expected path, the env-var override, and a reminder of the schema, then exits with code 3.

Staging GAIA

GAIA is a gated research dataset — you need to accept the dataset license on Hugging Face before downloading. Shrew does not vendor it.

Default location

~/.chimera/datasets/gaia/
    tasks.json

Override via $CHIMERA_GAIA_PATH=/abs/path/to/dir.

`tasks.json` schema

[
  {
    "task_id": "abc-123",
    "Question": "What was the population of ... in 2010?",
    "Final answer": "12345",
    "Level": 1,
    "file_name": "data.xlsx"
  }
]

Both "Question" / "question" and "Final answer" / "final_answer" keys are accepted to match the upstream parquet schema. Level is informational; the adapter accepts an optional level= filter.

Setup steps

Accept the GAIA dataset license on Hugging Face.
Download the validation set (or test set, if you have access).
Convert the parquet to tasks.json matching the schema above. A one-line conversion is sufficient: pandas → records → json.
Drop tasks.json at ~/.chimera/datasets/gaia/.

Run it

chimera shrew bench gaia --bench-limit 5
chimera shrew bench gaia --bench-limit 0      # full run (~165 tasks)

The default of --bench-limit 5 is intentional: an unguarded shrew bench gaia would otherwise kick off all 165 validation tasks on a paid LLM call when the user just wanted to smoke-test.

Scoring

Shrew re-implements the GAIA scorer locally rather than depending on an upstream gaia_scorer module. The scorer extracts an Answer: <value> line from the agent’s final reply and compares it to the gold using GAIA-style normalisation (accent stripping, lowercasing, article removal, list/numeric awareness).

If the grader is wrong on a specific task, record the raw answer and flag it for manual review. Do not loosen the scorer — the GAIA-style normalisation is faithful to the upstream rules; an incorrect score on a single task is a labelling issue, not a scorer bug.

Output shape

The harness prints a one-line summary on stdout:

aider-polyglot: passed=3/5 rate=60.0% cost=$0.0142
gaia: passed=2/5 rate=40.0% cost=$0.0089

Per-task event streams persist to ~/.chimera/eventlog/shrew-<id>/ like any other shrew session.

terminal-bench

Reserved by the parser for parity with otter / mink. Currently returns a friendly “not yet wired” message and exit code 3. The adapter is on the roadmap; the polyglot + GAIA pair is the smallest useful surface for shrew’s small-model focus, so they shipped first.

Wiring your own benchmark

Inherit from chimera.eval.harness.Benchmark:

from chimera.eval.harness import Benchmark, EvalResult

class MyBench(Benchmark):
    def tasks(self) -> list[dict]:
        ...

    def evaluate(self, task, agent_output) -> bool:
        ...

Then drive it with the same Harness shrew uses:

from chimera.eval.harness import Harness
from chimera.shrew.benchmarks.cli import build_shrew_agent_for_eval

harness = Harness(benchmark=MyBench(...), agent=build_shrew_agent_for_eval())
result = harness.run()
print(result)

For the in-tree examples, read chimera/shrew/benchmarks/aider_polyglot.py and chimera/shrew/benchmarks/gaia.py.

CI integration tips

Pin a model. Set $SHREW_MODEL so the harness doesn’t silently fall back to whatever cloud key happens to be in the CI environment.
Cache the dataset. Stage the dataset directory once per CI worker and reuse it via the $CHIMERA_*_PATH overrides.
Treat exit 3 as skip. The dataset may not be on every worker; 3 means “not staged here”, which is different from “ran and failed”. Skip the job rather than failing the build.
Record the result line. The one-line summary is parseable (bench: passed=N/M rate=X% cost=$Y); pipe it to your CI artifact store.

Shrew benchmarks

Benchmarks

Why these two

Command surface

Exit codes

Staging Aider Polyglot

Default location

tasks.json schema

Setup steps

Run it

Staging GAIA

Default location

tasks.json schema

Setup steps

Run it

Scoring

Output shape

terminal-bench

Wiring your own benchmark

CI integration tips

See also

`tasks.json` schema

`tasks.json` schema