BigCodeBench
BigCodeBench (practical library-calling tasks)
Section titled “BigCodeBench (practical library-calling tasks)”BigCodeBench is HuggingFace’s 1,140-task eval that exercises real Python library usage. Each task requires calling functions from one or more of 723 distinct libraries — requests, pandas, numpy, pathlib, subprocess, etc. Solutions are graded by executing them against test cases that touch the libraries directly, so a stub that ignores the requested API fails.
References:
- HuggingFace: https://huggingface.co/datasets/bigcode/bigcodebench
- Paper: arXiv:2406.15877
- GitHub: https://github.com/bigcode-project/bigcodebench
Status: TODO (adapter wired; not benchmarked yet)
Section titled “Status: TODO (adapter wired; not benchmarked yet)”| Run | Score |
|---|---|
| Chimera | NOT RUN |
Splits
Section titled “Splits”| Split | What it tests |
|---|---|
instruct (default) | Natural-language prompt — “Write a function that …” |
complete | Fill-in-the-blank: a partial function with a # TODO block. |
How to run
Section titled “How to run”# Stage the dataset locally (one-time)huggingface-cli download bigcode/bigcodebench --repo-type dataset \ --local-dir ~/.chimera/datasets/bigcodebenchfrom chimera.eval.benchmarks import BigCodeBenchfrom chimera.eval.harness import Harness
bench = BigCodeBench(split="instruct") # or "complete"print(bench.name()) # "bigcodebench-instruct"print(len(bench.tasks())) # 1140
harness = Harness(agent=my_agent, benchmark=bench)results = harness.run()Override the dataset path via the CHIMERA_BIGCODEBENCH_PATH env var or the dataset_path= kwarg.
Task shape
Section titled “Task shape”{ "task_id": "BigCodeBench/0", "complete_prompt": "...", "instruct_prompt": "Write a function that ...", "canonical_solution": "...", "test": "import unittest\nclass TestCases(unittest.TestCase): ...", "entry_point": "task_func", "libs": ["numpy", "pandas"]}Grading
Section titled “Grading”In-process: the agent’s output is exec’d, then the unittest.TestCases defined in the task’s test field is invoked. A task passes only when every test case in the suite passes.
Gotchas
Section titled “Gotchas”- The dataset is not auto-downloaded.
tasks()returns[]if nothing is staged. Pre-flight withBigCodeBench.dataset_available(). - Tests can touch the network and filesystem — run inside Docker or a sandboxed environment if you don’t trust the model’s output.
- Some tasks pin specific library versions in their imports. Build a matching venv before grading.