HumanEval
HumanEval (function generation)
Section titled “HumanEval (function generation)”HumanEval is OpenAI’s original code-generation eval: 164 Python problems, each with a function signature, docstring, and 7.7 unit tests on average. The agent fills in the function body; pass@1 is computed against the test set.
References:
- GitHub: https://github.com/openai/human-eval
- Paper: arXiv:2107.03374
- HuggingFace dataset: https://huggingface.co/datasets/openai_humaneval
Status: DONE (partial run reported)
Section titled “Status: DONE (partial run reported)”| Model | Score | Notes |
|---|---|---|
| GLM-5 | 90.9% pass@1 (claimed) | Raw data lost — unverifiable. |
| GLM-5.1 | 66.5% pass@1 (reported) | Raw data in data/. |
See docs/benchmarks/2026-03-30-humaneval-glm51.md for the report.
How to run
Section titled “How to run”# Smallest path — Python scriptsource .envpython examples/benchmarks/humaneval_full.py --count 164
# Via the CLIchimera eval --benchmark humaneval --dataset ./humaneval.jsonl --limit 164 --output results.jsonOr directly in code:
from chimera.eval.benchmarks import HumanEvalfrom chimera.eval.harness import Harness
bench = HumanEval(dataset_path="humaneval.jsonl")print(bench.name()) # "human-eval"print(len(bench.tasks())) # 164
harness = Harness(agent=my_agent, benchmark=bench)results = harness.run()print(results.pass_rate()) # e.g. 0.665Task shape
Section titled “Task shape”{ "task_id": "HumanEval/0", "prompt": "def has_close_elements(numbers, threshold):\n \"\"\"...\"\"\"\n", "canonical_solution": "...", "test": "def check(candidate):\n assert candidate([1.0, 2.0], 0.5) is False\n...", "entry_point": "has_close_elements"}Grading
Section titled “Grading”In-process: the agent’s output is concatenated onto the prompt, the file is executed, and the test’s check(candidate) is called. A 30-second timeout applies per task.
Gotchas
Section titled “Gotchas”- Dataset shape: HumanEval ships as JSONL via the
human_evalpip package, but the adapter also accepts{"tasks": [...]}JSON. Pass either. - Common failure mode: the model emits the full function instead of just the body. The adapter handles both — the post-processor strips a redundant signature when the prompt is repeated.
- For stricter coverage (extra hidden tests on the same problems), use HumanEval+.
- For multi-language coverage, use HumanEval-X.