HumanEval-X
HumanEval-X (multi-language HumanEval)
Section titled “HumanEval-X (multi-language HumanEval)”HumanEval-X translates the original 164 HumanEval problems into 5 languages
(Python, Java, C++, Go, JavaScript) — same problems, language-specific
function signatures and tests. Different surface from
humaneval_plus.py, which extends Python coverage with more tests for the
same Python problems.
References:
- HuggingFace: https://huggingface.co/datasets/THUDM/humaneval-x
- GitHub: https://github.com/THUDM/CodeGeeX
- Paper: arXiv:2303.17568
Status: SCAFFOLD
Section titled “Status: SCAFFOLD”| Surface | State |
|---|---|
HumanEvalXTask dataclass | DONE |
Loader (JSON / JSONL / {"tasks": [...]} / language filter / limit) | DONE |
| Python in-process grading | DONE (mirrors HumanEval) |
| Java / C++ / Go / JavaScript grading | STUBBED — returns False (not NotImplementedError) |
Discoverable via chimera eval --benchmark humaneval-x | DONE |
Live grading for compiled languages should reuse the runners under
chimera/eval/benchmarks/runners/ (see MultiSWEBench). Tracked as a
follow-up.
Quick start
Section titled “Quick start”from chimera.eval.benchmarks import HumanEvalX
# Filter to Python onlybench = HumanEvalX(dataset_path="humaneval-x.jsonl", language="python")print(bench.name()) # "humaneval-x-python"print(HumanEvalX.supported_languages())# ['cpp', 'go', 'java', 'javascript', 'python']Test harness format
Section titled “Test harness format”Python tasks use HumanEval’s self-driving harness — the test must call its
own check(candidate):
task = { "language": "python", "prompt": "def add(a, b):\n", "test": ( "def check(candidate):\n" " assert candidate(1, 2) == 3\n" "check(add)\n" ),}