SWE-Lancer
SWE-Lancer (freelance bug fixes)
Section titled “SWE-Lancer (freelance bug fixes)”SWE-Lancer (OpenAI, 2025) evaluates coding agents on 1,400+ real-world
freelance Upwork tickets totaling over $1M in payouts. Two task categories:
ic_swe— write a fix end-to-end against a Playwright test harness.swe_manager— pick the best of N proposed fixes (multiple choice).
The headline metric is dollar-weighted resolve rate:
sum(payout for passed tasks) / sum(all payouts).
References:
- Paper: arXiv:2502.12115
- GitHub: https://github.com/openai/SWELancer-Benchmark
Status: SCAFFOLD
Section titled “Status: SCAFFOLD”| Surface | State |
|---|---|
SWELancerTask dataclass (id, payout, category, choices, …) | DONE |
| Loader (JSON / JSONL, category + min_payout filters, limit) | DONE |
grade_manager_choice(task, idx) for swe_manager | DONE |
dollar_weighted_pass_rate(results) headline metric | DONE |
Live grading for ic_swe (Docker + Playwright harness) | NotImplementedError — follow-up |
Discoverable via chimera eval --benchmark swe-lancer | DONE |
evaluate(...) raises NotImplementedError for the live path so misuse
is loud. For swe_manager tasks, call grade_manager_choice directly —
no environment required.
Quick start
Section titled “Quick start”from chimera.eval.benchmarks import SWELancer
bench = SWELancer( dataset_path="swe-lancer.jsonl", category="ic_swe", min_payout=100.0, # focus on $100+ tickets)print(bench.name()) # "swe-lancer-ic_swe"
# Pre-graded results → headline metricrate = bench.dollar_weighted_pass_rate([("sl_1", True), ("sl_2", False)])Live integration plan
Section titled “Live integration plan”- Reuse
chimera.env.docker.DockerEnvironmentto provision the upstream harness image per task. - Write a
SWELancerRunneranalogous toMultiSWEBenchrunners that knows how to invoke the Playwright entrypoint and parse the JUnit-XML result. - Wire
evaluateto call the runner (replace theNotImplementedError).