HumanEval+
HumanEval+ (extended tests on HumanEval)
Section titled “HumanEval+ (extended tests on HumanEval)”HumanEval+ keeps HumanEval’s 164 Python problems but adds 80× more test cases per problem. Many models that score ≥80% on base HumanEval drop 10–20 points on plus. It’s the standard “is your model actually correct?” follow-up.
References:
- GitHub: https://github.com/evalplus/evalplus
- Paper: arXiv:2305.01210
- PyPI:
evalplus
Status: SCAFFOLD (adapter wired; runner gated on evalplus)
Section titled “Status: SCAFFOLD (adapter wired; runner gated on evalplus)”| Surface | State |
|---|---|
HumanEvalPlus adapter | DONE |
| Falls back to local JSON / JSONL dataset | DONE |
Uses evalplus.evaluate when the pip package is installed | DONE |
Discoverable via chimera eval --benchmark humaneval+ | TODO (#10) — load directly in Python today |
How to run
Section titled “How to run”pip install chimera-run[anthropic] evalplusfrom chimera.eval.benchmarks import HumanEvalPlusfrom chimera.eval.harness import Harness
bench = HumanEvalPlus(version="plus") # or "base"print(bench.name()) # "human-eval-plus"print(len(bench.tasks())) # 164
harness = Harness(agent=my_agent, benchmark=bench)results = harness.run()print(results.pass_rate())Without evalplus installed, point dataset_path at a local JSON / JSONL dump.
Grading contract
Section titled “Grading contract”evaluate() defers to evalplus.evaluate which runs the base + plus test set per problem under a 30-second timeout per case. A task passes only if every test passes.
Gotchas
Section titled “Gotchas”evalplusis heavy (numpy + sandboxing deps). Run it in a fresh venv to avoid version conflicts with the rest of Chimera’s deps.- The
plussuite includes adversarial inputs (extreme floats, empty containers); models that rely on common-case prompts often regress. - Use HumanEval for the baseline number that’s directly comparable to OpenAI’s published score.