Skip to content

MultiSWE-bench

MultiSWE-bench extends SWE-bench beyond Python. Each instance carries a language field and is graded with the language’s native test runner — pytest for Python, mvn test for Java, go test for Go, npm test for JavaScript / TypeScript, cargo test for Rust.

References:

FilePurpose
chimera/eval/benchmarks/multi_swe_bench.pyMultiSWEBench, MultiSWEBenchInstance
chimera/eval/benchmarks/runners/__init__.pyRUNNERS registry, get_runner()
chimera/eval/benchmarks/runners/base.pyLanguageRunner, RunnerResult, SkipReason
chimera/eval/benchmarks/runners/{python,java,go,javascript,rust}_runner.pyOne frozen runner per language
tests/eval/test_multi_swe_bench.py41 tests (loading, dispatch, skip patterns)
from chimera.eval.benchmarks import MultiSWEBench
# Load a JSON or JSON-lines dataset
bench = MultiSWEBench(dataset_path="multi-swe-bench.jsonl")
# Filter to a single language
go_bench = MultiSWEBench(dataset_path="multi-swe-bench.jsonl", language="go")
print(go_bench.name()) # "multi-swe-bench-go"
print(go_bench.language_breakdown()) # {"go": N}
LanguageTest commandToolchain probe
Pythonpytest -x --no-header -rNpython --version
Javamvn -q -B testmvn --version
Gogo test ./...go version
JavaScript / TypeScriptnpm test --silentnode --version
Rustcargo test --quietcargo --version

Aliases: jsjavascript, tstypescript, golanggo. The canonical names are returned by MultiSWEBench.supported_languages().

When the language toolchain isn’t installed in the execution environment, the runner short-circuits and records the skip reason instead of raising:

bench.evaluate(task, agent_output="...", env=env)
# False
bench.last_skip_reasons
# [("py__demo__1", "python", "toolchain_missing")]

Skip reasons (chimera.eval.benchmarks.runners.base.SkipReason):

  • no_envenv was None.
  • toolchain_missing — the toolchain probe failed.
  • patch_failedgit apply of the test patch failed.
  • execution_error — the test command itself raised or returned a non-numeric status.

Use MultiSWEBench.evaluate_detailed(task, env) if you want the full RunnerResult (with stdout, stderr, exit_code).

  1. Create chimera/eval/benchmarks/runners/<lang>_runner.py:

    from chimera.eval.benchmarks.runners.base import LanguageRunner
    KOTLIN_RUNNER = LanguageRunner(
    language="kotlin",
    test_command="gradle test --quiet",
    toolchain_command="gradle --version",
    display_name="Kotlin (Gradle)",
    )
  2. Register it in runners/__init__.py (RUNNERS["kotlin"] = KOTLIN_RUNNER).

  3. Add the language string to SUPPORTED_LANGUAGES in multi_swe_bench.py.

  4. Add a fixture line + dispatch test in test_multi_swe_bench.py.

ItemState
Loader (JSON / JSON-lines / {"tasks": [...]} / {"instances": [...]})DONE
Per-language dispatchDONE — 5 runners
Skip pattern when toolchain missingDONE
Detailed RunnerResult accessDONE
Live run vs upstream datasetTODO — requires Docker images per language
Score reporting integration with chimera bench CLITODO