Prompt Optimization
Systematically improve a baseline prompt against a held-out test set.
Overview
Candidates iterate on a baseline prompt with a fixed compute budget. Graded on lift over baseline, sample efficiency, and consistency across model providers.
Skills assessed
Brief
Prompt Optimization Challenge
Objective
Take a baseline prompt that achieves ~38% accuracy on a held-out classification task and iterate on it under a fixed compute budget. The final prompt is graded on lift over baseline, sample efficiency, and consistency across model providers.
The task is support-ticket intent classification for the same fictional product (Northstar Workspaces) used in other benchmarks. Inputs are short customer messages; outputs are one of 12 fixed intent labels.
What you submit
A repo.zip containing:
run.py— runs your final prompt against--testand emits predictions.prompt.txt— your final prompt template (the harness reads this for consistency-across-models checks).- Whatever helper code you used to iterate on
--train.
Required CLI
python run.py \
--train /data/train.json \
--test /data/test.json \
--output /tmp/results.json
You may use --train to compute few-shot exemplars at inference time, or
freeze them into prompt.txt ahead of time — both approaches are fine.
Input format
[
{
"id": "ex_001",
"ticket": "How do I add a guest to my workspace?",
"label": "product.guest_access"
}
]
The label field is included in both train.json and test.json in
this candidate package so that you can compute local accuracy honestly. The
hidden test set the harness runs your prompt against will not include
labels — your run.py must work whether or not label is present on a
row, and must never read label at inference time.
Output format
{
"final_prompt": "...full prompt template that was used...",
"predictions": [
{
"id": "ex_001",
"prediction": "product.guest_access",
"latency_ms": 220,
"tokens_used": 180
}
]
}
Scoring dimensions
- Lift over baseline — accuracy on the hidden test set vs. the 38% baseline. Most submissions reach 70%+; the bar for the top quartile is 85%+.
- Sample efficiency — number of train rows you actually used as few-shot exemplars (we extract this from your final prompt automatically).
- Token cost — average
tokens_usedper prediction. - Cross-provider consistency — we re-run your prompt against 3 model providers; high variance is penalised.
- Failure analysis — your
README.mdshould briefly describe what classes of errors remained and what you'd try next.
Local development
This package ships a 25-row labeled train.json (use it freely as an
exemplar pool) and a 5-row labeled test.json you can use to sanity-
check your local accuracy.
The hidden test set the sandbox actually grades against is larger
(25 rows) and follows the exact same shape except label is
stripped — your run.py must work whether or not label is present
on a row, and must never read label at inference time. The host
joins your predictions to the canonical labels on id after the run.
Models & cost — how it works
You don't need an OpenAI API key to submit. When your code runs in our
sandbox, we set OPENAI_API_KEY and OPENAI_BASE_URL for you so the
standard openai SDK calls our metering proxy. The proxy forwards to
OpenAI using the operator's key, records every call's real
prompt_tokens / completion_tokens / latency, and the cost panel on
your report is computed from those host-measured numbers — not anything
you self-report.
What this means for you:
- Just write
from openai import OpenAI; client = OpenAI()— the SDK picks up the env vars automatically. No key setup on your end. - You don't need to populate
tokens_used,latency_ms, ormodelinresults.json. Those fields are now ignored. (You may still log them for your own debugging.) - Each submission has a hard upstream budget of $3.00 USD by default. If you blow through it, further OpenAI calls return HTTP 402 and the rest of your run will fail. Pick efficient models.
- Model choice is part of what we evaluate. Solving the task on a small, cheap model demonstrates more skill than reaching for the largest one. Suggested defaults:
gpt-4o-mini— best price/quality default for almost everything.gpt-4.1-mini— budget alternative when reasoning depth matters.- Larger models (
gpt-4o,gpt-4.1,o1,o3-mini) — only reach for these when you can justify it in your README.
Want to test locally before submitting? Use your own OpenAI key and the
real https://api.openai.com base URL — your local runs are billed to
you, not us. The proxy only kicks in when your code runs inside our
evaluation sandbox.
Repository contract
Your zip must match the following layout.
repo.zip ├── run.py # required entrypoint ├── README.md # required ├── requirements.txt # optional └── prompt.txt # final optimised prompt
Entrypoint
run.py must accept the following arguments.
python run.py \ --train /data/train.json \ --test /data/test.json \ --output /tmp/results.json
Output schema
Write the result to /tmp/results.json.
{
"final_prompt": "...",
"predictions": [
{
"id": "ex_001",
"prediction": "positive",
"latency_ms": 220,
"tokens_used": 180
}
]
}
Challenge package
25 train + 25 test examples · baseline prompt · iteration harness
Download prompt-optimization_candidate_package.zipIncludes
- 25 train + 25 test (input, expected_output) pairs (classification task)
- baseline_prompt.txt with 38% baseline accuracy
- Iteration harness skeleton & Dockerfile
- requirements.txt
Hidden test data and ground-truth answers are not included. Your code runs against them inside the sandbox.
Submit
Validation
- .zip archive only
- run.py at root or top-level dir
- README.md present
- No hidden evaluation files
- Size ≤ 20MB
Tip: register a profile first — your submission auto-attaches to it when emails match.