← Benchmarks

Overview

Candidates ship an agent that decomposes ambiguous prompts, calls the right tools in the right order, and recovers gracefully from tool failures. Graded on plan quality, completion rate, and cost.

Questions80
Domains4
Duration6–10 hours
Slugagent-orchestration

Skills assessed

Tool selectionPlanningState managementError recoveryLatency optimizationCost control

Brief

Agent Orchestration Challenge

Objective

Build an agent that can decompose ambiguous enterprise prompts, call the right tools in the right order, and recover gracefully from tool failures.

Your agent should:

  • Decompose each task into an explicit ordered plan of steps.
  • Pick the correct tools from the provided catalog (tools.json).
  • Recover from flaky tools (one of the mock tools intentionally fails ~30% of the time).
  • Stop early when the task is complete instead of over-calling tools.
  • Track and minimise tool calls and token usage — both are scored.

Required CLI

Your submission must support:

python run.py \
  --tasks  /data/tasks.json \
  --tools  /data/tools.json \
  --output /tmp/results.json

Input format

tasks.json is a JSON array:

[
  {
    "id": "task_001",
    "domain": "support",
    "prompt": "A customer is asking why their last invoice is higher than expected. Their email is alex@acme.test."
  }
]

tools.json is a JSON array describing each tool:

[
  {
    "name": "lookup_account",
    "description": "Look up an account by email. Returns account_id, plan, status.",
    "args_schema": { "email": "string" }
  }
]

Output format

Write a JSON array to /tmp/results.json. One entry per task, in any order.

[
  {
    "id": "task_001",
    "plan": ["lookup_account", "fetch_invoice", "summarize"],
    "tool_calls": [
      { "tool": "lookup_account", "args": { "email": "alex@acme.test" } },
      { "tool": "fetch_invoice",  "args": { "account_id": "acc_92" } }
    ],
    "final_answer": "Their May invoice was $312 vs $214 in April because of overage on...",
    "latency_ms": 1840,
    "tokens_used": 612,
    "tool_calls_count": 3
  }
]

Scoring dimensions

  • Plan quality — does the plan match a competent operator's plan?
  • Completion rate — fraction of tasks where final_answer resolves the prompt.
  • Tool selection — fraction of tool calls that were appropriate vs. wasted.
  • Recovery — does the agent retry / re-plan after a flaky-tool failure?
  • Costtool_calls_count and tokens_used are penalised at the margin.

Model gateway

The sandbox exposes a model gateway via two env vars:

MODEL_GATEWAY_URL=https://...
MODEL_GATEWAY_KEY=...
CHAT_MODEL=standard-chat

Use it as you would a regular OpenAI-style chat completion endpoint. Do not call external model providers directly — those calls are blocked by the sandbox network policy.

Local development

The tasks.json shipped with this package contains 5 sample tasks — one per major domain plus one edge case (GDPR). They match the exact shape of the hidden test set used at evaluation time. The hidden set is roughly 4× larger (~20 tasks) and is mounted into the sandbox at /data/tasks.json automatically — your run.py should not assume any particular task count.

Domains in the hidden set:

  • support — customer support triage
  • ops — internal ops automation (filing, scheduling, lookups)
  • data — light analytical tasks over fixture data
  • growth — outbound / lifecycle copy generation

Use the samples to drive your loop locally, then submit. We grade on the full hidden set, not just the samples.

Models & cost — how it works

You don't need an OpenAI API key to submit. When your code runs in our sandbox, we set OPENAI_API_KEY and OPENAI_BASE_URL for you so the standard openai SDK calls our metering proxy. The proxy forwards to OpenAI using the operator's key, records every call's real prompt_tokens / completion_tokens / latency, and the cost panel on your report is computed from those host-measured numbers — not anything you self-report.

What this means for you:

  • Just write from openai import OpenAI; client = OpenAI() — the SDK picks up the env vars automatically. No key setup on your end.
  • You don't need to populate tokens_used, latency_ms, or model in results.json. Those fields are now ignored. (You may still log them for your own debugging.)
  • Each submission has a hard upstream budget of $3.00 USD by default. If you blow through it, further OpenAI calls return HTTP 402 and the rest of your run will fail. Pick efficient models.
  • Model choice is part of what we evaluate. Solving the task on a small, cheap model demonstrates more skill than reaching for the largest one. Suggested defaults:
  • gpt-4o-mini — best price/quality default for almost everything.
  • gpt-4.1-mini — budget alternative when reasoning depth matters.
  • Larger models (gpt-4o, gpt-4.1, o1, o3-mini) — only reach for these when you can justify it in your README.

Want to test locally before submitting? Use your own OpenAI key and the real https://api.openai.com base URL — your local runs are billed to you, not us. The proxy only kicks in when your code runs inside our evaluation sandbox.

Repository contract

Your zip must match the following layout.

repo.zip
├── run.py            # required entrypoint
├── README.md         # required
├── requirements.txt  # optional
└── ...               # your agent implementation

Entrypoint

run.py must accept the following arguments.

python run.py \
  --tasks  /data/tasks.json \
  --tools  /data/tools.json \
  --output /tmp/results.json

Output schema

Write the result to /tmp/results.json.

[
  {
    "id": "task_001",
    "plan": ["lookup_account", "fetch_invoice", "summarize"],
    "tool_calls": [{"tool": "lookup_account", "args": {...}}],
    "final_answer": "...",
    "latency_ms": 1840,
    "tokens_used": 612,
    "tool_calls_count": 3
  }
]

Challenge package

20 sample tasks · 4 domains · 8 mock tools · starter agent

Download agent-orchestration_candidate_package.zip

Includes

  • 20 sample tasks across 4 domains
  • 8 mock tools with JSON specs (and a flaky one)
  • Starter agent loop & Dockerfile
  • requirements.txt

Hidden test data and ground-truth answers are not included. Your code runs against them inside the sandbox.

Submit

Max 20MB · must contain run.py and README.md
We're not against AI-assisted coding — we expect it. We just want to see how well you direct an AI. Your prompt is shown alongside your submission and passed to the LLM judge as extra context, so well-structured prompts with clear intent, constraints, and evaluation criteria can lift your score. Leave blank if you didn't use AI.
Only fill this in if a hiring team gave you a code. Submissions with a code go to that company privately and are not added to the public talent pool.

Validation

  • .zip archive only
  • run.py at root or top-level dir
  • README.md present
  • No hidden evaluation files
  • Size ≤ 20MB

Tip: register a profile first — your submission auto-attaches to it when emails match.