Skip to content

chimera.eval

chimera.eval is the evaluation harness: run an agent against a benchmark, compute metrics, write reports.

from chimera.eval import (
Harness,
Benchmark,
pass_at_k,
resolve_rate,
avg_cost,
)
from chimera.eval.benchmarks import (
SWEBenchBenchmark,
HumanEvalBenchmark,
AIMOBenchmark,
CustomBenchmark,
)
SymbolPurpose
HarnessDrives an agent through a benchmark. Harness(agent_factory, benchmark).run(limit=...).
BenchmarkABC. Implement tasks() and score(task, result).
pass_at_k(results, k)Pass@k metric.
resolve_rate(results)Fraction of tasks marked resolved.
avg_cost(results)Mean USD per task.
SWEBenchBenchmarkSWE-bench / SWE-bench Lite loader.
HumanEvalBenchmarkHumanEval (164 problems).
AIMOBenchmarkAIMO math benchmark.
CustomBenchmarkBring-your-own task list.

Plug a custom benchmark in by subclassing Benchmark.