Skip to content

LiveCodeBench

LiveCodeBench (contamination-free competitive programming)

Section titled “LiveCodeBench (contamination-free competitive programming)”

LiveCodeBench sources problems from LeetCode, AtCoder, and Codeforces and timestamps each. Pick a start_date after your model’s training cutoff and you have a zero-contamination eval — the model can’t have seen any of these problems during pretraining.

References:

Status: TODO (adapter wired; not benchmarked yet)

Section titled “Status: TODO (adapter wired; not benchmarked yet)”
RunScore
ChimeraNOT RUN
ScenarioWhat it tests
codegeneration (default)Given a problem statement, generate a solution.
selfrepairGiven a buggy solution + failing tests, fix it.
codeexecutionPredict the output of code without running it.
testoutputGenerate test inputs that exercise a target branch.
from chimera.eval.benchmarks import LiveCodeBench
from chimera.eval.harness import Harness
# Code generation, post-GLM-5 training cutoff
bench = LiveCodeBench(
dataset_path="livecodebench.jsonl",
scenario="codegeneration",
start_date="2025-01-01", # after model's training cutoff
)
print(bench.name()) # e.g. "livecodebench-codegeneration-2025-01-01_..."
# Self-repair window
bench = LiveCodeBench(
dataset_path="livecodebench.jsonl",
scenario="selfrepair",
start_date="2025-01-01",
end_date="2025-06-30",
)
{
"problem_id": "twosum-v2-leetcode",
"title": "Two Sum II",
"difficulty": "easy",
"platform": "leetcode",
"release_date": "2025-03-15",
"problem_statement": "...",
"starter_code": "def two_sum(nums, target): ...",
"public_tests": [{"input": "...", "output": "..."}],
"private_tests": [...]
}

Public + private test cases are run with a 30-second per-task timeout. Both sets must pass. The adapter reports per-difficulty breakdowns (easy, medium, hard).

  • Pick start_date carefully. A start date before your model’s cutoff invalidates the contamination guarantee.
  • The dataset is updated monthly. Re-pull periodically to keep the window fresh.
  • Private tests are not redistributable in raw form — load them via the upstream JSONL releases on the dataset’s GitHub releases page.