Skip to content

HumanEval+

HumanEval+ keeps HumanEval’s 164 Python problems but adds 80× more test cases per problem. Many models that score ≥80% on base HumanEval drop 10–20 points on plus. It’s the standard “is your model actually correct?” follow-up.

References:

Status: SCAFFOLD (adapter wired; runner gated on evalplus)

Section titled “Status: SCAFFOLD (adapter wired; runner gated on evalplus)”
SurfaceState
HumanEvalPlus adapterDONE
Falls back to local JSON / JSONL datasetDONE
Uses evalplus.evaluate when the pip package is installedDONE
Discoverable via chimera eval --benchmark humaneval+TODO (#10) — load directly in Python today
Terminal window
pip install chimera-run[anthropic] evalplus
from chimera.eval.benchmarks import HumanEvalPlus
from chimera.eval.harness import Harness
bench = HumanEvalPlus(version="plus") # or "base"
print(bench.name()) # "human-eval-plus"
print(len(bench.tasks())) # 164
harness = Harness(agent=my_agent, benchmark=bench)
results = harness.run()
print(results.pass_rate())

Without evalplus installed, point dataset_path at a local JSON / JSONL dump.

evaluate() defers to evalplus.evaluate which runs the base + plus test set per problem under a 30-second timeout per case. A task passes only if every test passes.

  • evalplus is heavy (numpy + sandboxing deps). Run it in a fresh venv to avoid version conflicts with the rest of Chimera’s deps.
  • The plus suite includes adversarial inputs (extreme floats, empty containers); models that rely on common-case prompts often regress.
  • Use HumanEval for the baseline number that’s directly comparable to OpenAI’s published score.