Skip to content

tau-bench

tau-bench (Sierra Research, 2024) tests whether an agent can carry out multi-turn business tasks by calling tools against simulated APIs. Each task is a conversation with a simulated user; grading compares the database state at end-of-conversation to a target state.

References:

Status: TODO (adapter wired; not benchmarked yet)

Section titled “Status: TODO (adapter wired; not benchmarked yet)”
RunScore
ChimeraNOT RUN — depends on upstream tau-bench install for grading.
DomainTasksWhat it tests
airline~50Bookings, cancellations, rebooking.
retail~115Orders, returns, exchanges.
telecom~50Plan changes, troubleshooting flows.
banking~70Transfers, disputes, statements.
mock~5Minimal test domain; no upstream install needed.
Terminal window
# Install upstream (for the simulated user + DB)
pip install tau-bench
from chimera.eval.benchmarks import TauBench
from chimera.eval.harness import Harness
bench = TauBench(domain="retail", dataset_path="tau-bench/retail")
print(bench.name()) # "tau-bench:retail"
print(len(bench.tasks())) # ~115
harness = Harness(agent=my_agent, benchmark=bench)
results = harness.run()

For a sanity check without the upstream install, use domain="mock" — the adapter ships a self-contained 5-task mock domain.

{
"task_id": "retail-0",
"domain": "retail",
"instruction": "I want to return order #123 ...",
"initial_db_state": "...",
"target_db_state": "...",
"allowed_tools": ["search_order", "issue_refund", ...]
}

After the conversation ends, the simulated DB is compared to target_db_state field-by-field. Partial credit is reported per field; a task passes only when the comparison is exact.

  • The simulated user is a separate LLM call per turn — total cost is (agent turns + user turns) × per-turn cost. Budget accordingly.
  • Tool definitions must match the domain’s allowed set exactly. The adapter loads them from the upstream package.
  • Stochastic outcomes: a 5-task mock run is enough for smoke; for headline numbers, run each task 3× and average.