Agent Orchestration
Design multi-agent workflows that compose tools to complete enterprise tasks.
Overview
Candidates ship an agent that decomposes ambiguous prompts, calls the right tools in the right order, and recovers gracefully from tool failures. Graded on plan quality, completion rate, and cost.
Skills assessed
Brief
Agent Orchestration Challenge
Objective
Build an agent that can decompose ambiguous enterprise prompts, call the right tools in the right order, and recover gracefully from tool failures.
Your agent should:
- Decompose each task into an explicit ordered plan of steps.
- Pick the correct tools from the provided catalog (
tools.json). - Recover from flaky tools (one of the mock tools intentionally fails ~30% of the time).
- Stop early when the task is complete instead of over-calling tools.
- Track and minimise tool calls and token usage — both are scored.
Required CLI
Your submission must support:
python run.py \
--tasks /data/tasks.json \
--tools /data/tools.json \
--output /tmp/results.json
Input format
tasks.json is a JSON array:
[
{
"id": "task_001",
"domain": "support",
"prompt": "A customer is asking why their last invoice is higher than expected. Their email is alex@acme.test."
}
]
tools.json is a JSON array describing each tool:
[
{
"name": "lookup_account",
"description": "Look up an account by email. Returns account_id, plan, status.",
"args_schema": { "email": "string" }
}
]
Output format
Write a JSON array to /tmp/results.json. One entry per task, in any order.
[
{
"id": "task_001",
"plan": ["lookup_account", "fetch_invoice", "summarize"],
"tool_calls": [
{ "tool": "lookup_account", "args": { "email": "alex@acme.test" } },
{ "tool": "fetch_invoice", "args": { "account_id": "acc_92" } }
],
"final_answer": "Their May invoice was $312 vs $214 in April because of overage on...",
"latency_ms": 1840,
"tokens_used": 612,
"tool_calls_count": 3
}
]
Scoring dimensions
- Plan quality — does the plan match a competent operator's plan?
- Completion rate — fraction of tasks where
final_answerresolves the prompt. - Tool selection — fraction of tool calls that were appropriate vs. wasted.
- Recovery — does the agent retry / re-plan after a flaky-tool failure?
- Cost —
tool_calls_countandtokens_usedare penalised at the margin.
Model gateway
The sandbox exposes a model gateway via two env vars:
MODEL_GATEWAY_URL=https://...
MODEL_GATEWAY_KEY=...
CHAT_MODEL=standard-chat
Use it as you would a regular OpenAI-style chat completion endpoint. Do not call external model providers directly — those calls are blocked by the sandbox network policy.
Local development
The tasks.json shipped with this package contains 5 sample tasks —
one per major domain plus one edge case (GDPR). They match the exact
shape of the hidden test set used at evaluation time. The hidden set is
roughly 4× larger (~20 tasks) and is mounted into the sandbox at
/data/tasks.json automatically — your run.py should not assume any
particular task count.
Domains in the hidden set:
support— customer support triageops— internal ops automation (filing, scheduling, lookups)data— light analytical tasks over fixture datagrowth— outbound / lifecycle copy generation
Use the samples to drive your loop locally, then submit. We grade on the full hidden set, not just the samples.
Models & cost — how it works
You don't need an OpenAI API key to submit. When your code runs in our
sandbox, we set OPENAI_API_KEY and OPENAI_BASE_URL for you so the
standard openai SDK calls our metering proxy. The proxy forwards to
OpenAI using the operator's key, records every call's real
prompt_tokens / completion_tokens / latency, and the cost panel on
your report is computed from those host-measured numbers — not anything
you self-report.
What this means for you:
- Just write
from openai import OpenAI; client = OpenAI()— the SDK picks up the env vars automatically. No key setup on your end. - You don't need to populate
tokens_used,latency_ms, ormodelinresults.json. Those fields are now ignored. (You may still log them for your own debugging.) - Each submission has a hard upstream budget of $3.00 USD by default. If you blow through it, further OpenAI calls return HTTP 402 and the rest of your run will fail. Pick efficient models.
- Model choice is part of what we evaluate. Solving the task on a small, cheap model demonstrates more skill than reaching for the largest one. Suggested defaults:
gpt-4o-mini— best price/quality default for almost everything.gpt-4.1-mini— budget alternative when reasoning depth matters.- Larger models (
gpt-4o,gpt-4.1,o1,o3-mini) — only reach for these when you can justify it in your README.
Want to test locally before submitting? Use your own OpenAI key and the
real https://api.openai.com base URL — your local runs are billed to
you, not us. The proxy only kicks in when your code runs inside our
evaluation sandbox.
Repository contract
Your zip must match the following layout.
repo.zip ├── run.py # required entrypoint ├── README.md # required ├── requirements.txt # optional └── ... # your agent implementation
Entrypoint
run.py must accept the following arguments.
python run.py \ --tasks /data/tasks.json \ --tools /data/tools.json \ --output /tmp/results.json
Output schema
Write the result to /tmp/results.json.
[
{
"id": "task_001",
"plan": ["lookup_account", "fetch_invoice", "summarize"],
"tool_calls": [{"tool": "lookup_account", "args": {...}}],
"final_answer": "...",
"latency_ms": 1840,
"tokens_used": 612,
"tool_calls_count": 3
}
]
Challenge package
20 sample tasks · 4 domains · 8 mock tools · starter agent
Download agent-orchestration_candidate_package.zipIncludes
- 20 sample tasks across 4 domains
- 8 mock tools with JSON specs (and a flaky one)
- Starter agent loop & Dockerfile
- requirements.txt
Hidden test data and ground-truth answers are not included. Your code runs against them inside the sandbox.
Submit
Validation
- .zip archive only
- run.py at root or top-level dir
- README.md present
- No hidden evaluation files
- Size ≤ 20MB
Tip: register a profile first — your submission auto-attaches to it when emails match.