Skip to content

MBPP

MBPP is Google’s 974-problem Python eval. Each task is a one-line natural-language description plus three assert statements. A solution passes when all three asserts hold.

References:

Status: TODO (adapter wired; not benchmarked yet)

Section titled “Status: TODO (adapter wired; not benchmarked yet)”
RunScore
ChimeraNOT RUN

The adapter is feature-complete; the gap is a baseline run. Tracked as issue #11.

from chimera.eval.benchmarks import MBPP
from chimera.eval.harness import Harness
bench = MBPP(dataset_path="mbpp.jsonl", split="test")
print(bench.name()) # "mbpp-test"
print(len(bench.tasks())) # 974 (or per split)
harness = Harness(agent=my_agent, benchmark=bench)
results = harness.run()
print(results.pass_rate())

Splits: train (374), validation (90), test (500), prompt (10). The 974 total is the union. The name() is suffixed with the split (mbpp-test, mbpp-validation, …).

{
"task_id": 11,
"text": "Write a function to remove first and last occurrence of a given character from the string.",
"code": "def remove_Occ(s, ch): ...",
"test_list": [
"assert remove_Occ(\"hello\", \"l\") == \"heo\"",
"assert remove_Occ(\"abcda\", \"a\") == \"bcd\"",
"assert remove_Occ(\"PHP\", \"P\") == \"H\""
]
}

In-process: the agent’s output is exec()’d, then each assert from test_list is run. A task passes only if all asserts pass (no partial credit).

  • MBPP’s prompt style is terse — small models often need a few-shot prefix to grasp the convention. The adapter does not prepend examples by default.
  • Watch for prompt leakage: the dataset’s code field is the reference solution. Filter it out of your prompt format.
  • Some asserts depend on Python version quirks (dict ordering, float formatting). The adapter does not retry on near-misses.