Skip to content

HumanEval-X

HumanEval-X translates the original 164 HumanEval problems into 5 languages (Python, Java, C++, Go, JavaScript) — same problems, language-specific function signatures and tests. Different surface from humaneval_plus.py, which extends Python coverage with more tests for the same Python problems.

References:

SurfaceState
HumanEvalXTask dataclassDONE
Loader (JSON / JSONL / {"tasks": [...]} / language filter / limit)DONE
Python in-process gradingDONE (mirrors HumanEval)
Java / C++ / Go / JavaScript gradingSTUBBED — returns False (not NotImplementedError)
Discoverable via chimera eval --benchmark humaneval-xDONE

Live grading for compiled languages should reuse the runners under chimera/eval/benchmarks/runners/ (see MultiSWEBench). Tracked as a follow-up.

from chimera.eval.benchmarks import HumanEvalX
# Filter to Python only
bench = HumanEvalX(dataset_path="humaneval-x.jsonl", language="python")
print(bench.name()) # "humaneval-x-python"
print(HumanEvalX.supported_languages())
# ['cpp', 'go', 'java', 'javascript', 'python']

Python tasks use HumanEval’s self-driving harness — the test must call its own check(candidate):

task = {
"language": "python",
"prompt": "def add(a, b):\n",
"test": (
"def check(candidate):\n"
" assert candidate(1, 2) == 3\n"
"check(add)\n"
),
}