← Benchmarks

Overview

Candidates iterate on a baseline prompt with a fixed compute budget. Graded on lift over baseline, sample efficiency, and consistency across model providers.

Questions100
Domains3
Duration2–4 hours
Slugprompt-optimization

Skills assessed

Prompt engineeringIteration disciplineA/B methodologyFew-shot selectionCost controlFailure analysis

Brief

Prompt Optimization Challenge

Objective

Take a baseline prompt that achieves ~38% accuracy on a held-out classification task and iterate on it under a fixed compute budget. The final prompt is graded on lift over baseline, sample efficiency, and consistency across model providers.

The task is support-ticket intent classification for the same fictional product (Northstar Workspaces) used in other benchmarks. Inputs are short customer messages; outputs are one of 12 fixed intent labels.

What you submit

A repo.zip containing:

  • run.py — runs your final prompt against --test and emits predictions.
  • prompt.txt — your final prompt template (the harness reads this for consistency-across-models checks).
  • Whatever helper code you used to iterate on --train.

Required CLI

python run.py \
  --train  /data/train.json \
  --test   /data/test.json \
  --output /tmp/results.json

You may use --train to compute few-shot exemplars at inference time, or freeze them into prompt.txt ahead of time — both approaches are fine.

Input format

[
  {
    "id": "ex_001",
    "ticket": "How do I add a guest to my workspace?",
    "label": "product.guest_access"
  }
]

The label field is included in both train.json and test.json in this candidate package so that you can compute local accuracy honestly. The hidden test set the harness runs your prompt against will not include labels — your run.py must work whether or not label is present on a row, and must never read label at inference time.

Output format

{
  "final_prompt": "...full prompt template that was used...",
  "predictions": [
    {
      "id": "ex_001",
      "prediction": "product.guest_access",
      "latency_ms": 220,
      "tokens_used": 180
    }
  ]
}

Scoring dimensions

  • Lift over baseline — accuracy on the hidden test set vs. the 38% baseline. Most submissions reach 70%+; the bar for the top quartile is 85%+.
  • Sample efficiency — number of train rows you actually used as few-shot exemplars (we extract this from your final prompt automatically).
  • Token cost — average tokens_used per prediction.
  • Cross-provider consistency — we re-run your prompt against 3 model providers; high variance is penalised.
  • Failure analysis — your README.md should briefly describe what classes of errors remained and what you'd try next.

Local development

This package ships a 25-row labeled train.json (use it freely as an exemplar pool) and a 5-row labeled test.json you can use to sanity- check your local accuracy.

The hidden test set the sandbox actually grades against is larger (25 rows) and follows the exact same shape except label is stripped — your run.py must work whether or not label is present on a row, and must never read label at inference time. The host joins your predictions to the canonical labels on id after the run.

Models & cost — how it works

You don't need an OpenAI API key to submit. When your code runs in our sandbox, we set OPENAI_API_KEY and OPENAI_BASE_URL for you so the standard openai SDK calls our metering proxy. The proxy forwards to OpenAI using the operator's key, records every call's real prompt_tokens / completion_tokens / latency, and the cost panel on your report is computed from those host-measured numbers — not anything you self-report.

What this means for you:

  • Just write from openai import OpenAI; client = OpenAI() — the SDK picks up the env vars automatically. No key setup on your end.
  • You don't need to populate tokens_used, latency_ms, or model in results.json. Those fields are now ignored. (You may still log them for your own debugging.)
  • Each submission has a hard upstream budget of $3.00 USD by default. If you blow through it, further OpenAI calls return HTTP 402 and the rest of your run will fail. Pick efficient models.
  • Model choice is part of what we evaluate. Solving the task on a small, cheap model demonstrates more skill than reaching for the largest one. Suggested defaults:
  • gpt-4o-mini — best price/quality default for almost everything.
  • gpt-4.1-mini — budget alternative when reasoning depth matters.
  • Larger models (gpt-4o, gpt-4.1, o1, o3-mini) — only reach for these when you can justify it in your README.

Want to test locally before submitting? Use your own OpenAI key and the real https://api.openai.com base URL — your local runs are billed to you, not us. The proxy only kicks in when your code runs inside our evaluation sandbox.

Repository contract

Your zip must match the following layout.

repo.zip
├── run.py            # required entrypoint
├── README.md         # required
├── requirements.txt  # optional
└── prompt.txt        # final optimised prompt

Entrypoint

run.py must accept the following arguments.

python run.py \
  --train  /data/train.json \
  --test   /data/test.json \
  --output /tmp/results.json

Output schema

Write the result to /tmp/results.json.

{
  "final_prompt": "...",
  "predictions": [
    {
      "id": "ex_001",
      "prediction": "positive",
      "latency_ms": 220,
      "tokens_used": 180
    }
  ]
}

Challenge package

25 train + 25 test examples · baseline prompt · iteration harness

Download prompt-optimization_candidate_package.zip

Includes

  • 25 train + 25 test (input, expected_output) pairs (classification task)
  • baseline_prompt.txt with 38% baseline accuracy
  • Iteration harness skeleton & Dockerfile
  • requirements.txt

Hidden test data and ground-truth answers are not included. Your code runs against them inside the sandbox.

Submit

Max 20MB · must contain run.py and README.md
We're not against AI-assisted coding — we expect it. We just want to see how well you direct an AI. Your prompt is shown alongside your submission and passed to the LLM judge as extra context, so well-structured prompts with clear intent, constraints, and evaluation criteria can lift your score. Leave blank if you didn't use AI.
Only fill this in if a hiring team gave you a code. Submissions with a code go to that company privately and are not added to the public talent pool.

Validation

  • .zip archive only
  • run.py at root or top-level dir
  • README.md present
  • No hidden evaluation files
  • Size ≤ 20MB

Tip: register a profile first — your submission auto-attaches to it when emails match.