← Benchmarks

Overview

Candidates author a grading harness for a free-form generation task. Their rubric is itself graded against expert agreement on a held-out set.

Questions40
Domains2
Duration3–6 hours
Slugevaluation-design

Skills assessed

Rubric designInter-rater agreementStatistical powerBias detectionCost-efficient gradingReporting

Brief

Evaluation Design Challenge

Objective

Build a grading harness for a free-form generation task. The task is a writing assistant that turns terse engineering tickets into polished customer-facing release notes. Your harness will be re-applied to a hidden set of model outputs and graded against expert agreement.

You are graded on how well your rubric tracks expert grades, not on how generous or strict your scores are.

What you submit

A repo.zip whose run.py:

  1. Reads the candidate model outputs in samples.json.
  2. Writes your rubric to /tmp/rubric.json.
  3. Writes per-sample scores to /tmp/results.json.

You may use the model gateway as an LLM-as-judge — most strong submissions do.

Required CLI

python run.py \
  --samples /data/samples.json \
  --rubric  /tmp/rubric.json \
  --output  /tmp/results.json

Input format

samples.json is a JSON array:

[
  {
    "id": "sample_001",
    "input_ticket": "JIRA-4421: backend now returns 429 instead of 503 when caller exceeds rate limit. update docs and changelog.",
    "model_output": "We've updated our APIs to return HTTP 429 instead of 503 when..."
  }
]

Rubric format

rubric.json is your design artefact. Suggested shape:

{
  "name": "release_notes_rubric_v1",
  "dimensions": [
    {"key": "clarity", "scale": [1, 5], "description": "..."},
    {"key": "accuracy", "scale": [1, 5], "description": "..."},
    {"key": "tone", "scale": [1, 5], "description": "..."}
  ],
  "weights": {"clarity": 0.4, "accuracy": 0.4, "tone": 0.2},
  "grading_prompt": "..."
}

You may add more dimensions, free-form criteria, or anchor examples — the grader only needs to be able to reproduce your scores.

Output format

results.json is a JSON array, one entry per sample:

[
  {
    "id": "sample_001",
    "scores": {"clarity": 4, "accuracy": 5, "tone": 3},
    "overall": 4.0,
    "rationale": "Clear and accurate, tone is slightly too formal for changelog voice.",
    "latency_ms": 320,
    "tokens_used": 410
  }
]

Scoring dimensions

  • Inter-rater agreement — Cohen's κ between your scores and the held-out expert scores on the hidden set.
  • Bias detection — does your rubric penalise length, formality, or hedging in ways the expert grader does not?
  • Statistical power — your rubric must distinguish good from mediocre outputs at p < 0.05 with 30 samples.
  • Cost-efficient gradingtokens_used per sample is penalised at the margin.
  • Reporting — your rationale must be specific enough for the original author to act on.

Local development

This package ships 5 sample model outputs in samples.json and the corresponding 5 expert grades in expert_grades.json so you can calibrate your rubric locally. The hidden set the sandbox actually runs your grader against is larger (20 samples) and follows the exact same JSON shape — your run.py should not assume any specific sample count, and it will not see the hidden expert grades.

Inter-rater agreement between your scores and the held-out experts is the primary metric. Spend most of your time tightening the rubric, not on prompt micro-tweaks.

Models & cost — how it works

You don't need an OpenAI API key to submit. When your code runs in our sandbox, we set OPENAI_API_KEY and OPENAI_BASE_URL for you so the standard openai SDK calls our metering proxy. The proxy forwards to OpenAI using the operator's key, records every call's real prompt_tokens / completion_tokens / latency, and the cost panel on your report is computed from those host-measured numbers — not anything you self-report.

What this means for you:

  • Just write from openai import OpenAI; client = OpenAI() — the SDK picks up the env vars automatically. No key setup on your end.
  • You don't need to populate tokens_used, latency_ms, or model in results.json. Those fields are now ignored. (You may still log them for your own debugging.)
  • Each submission has a hard upstream budget of $3.00 USD by default. If you blow through it, further OpenAI calls return HTTP 402 and the rest of your run will fail. Pick efficient models.
  • Model choice is part of what we evaluate. Solving the task on a small, cheap model demonstrates more skill than reaching for the largest one. Suggested defaults:
  • gpt-4o-mini — best price/quality default for almost everything.
  • gpt-4.1-mini — budget alternative when reasoning depth matters.
  • Larger models (gpt-4o, gpt-4.1, o1, o3-mini) — only reach for these when you can justify it in your README.

Want to test locally before submitting? Use your own OpenAI key and the real https://api.openai.com base URL — your local runs are billed to you, not us. The proxy only kicks in when your code runs inside our evaluation sandbox.

Repository contract

Your zip must match the following layout.

repo.zip
├── run.py            # required entrypoint
├── README.md         # required
├── requirements.txt  # optional
└── ...               # rubric, grader, calibration code

Entrypoint

run.py must accept the following arguments.

python run.py \
  --samples /data/samples.json \
  --rubric  /tmp/rubric.json \
  --output  /tmp/results.json

Output schema

Write the result to /tmp/results.json.

[
  {
    "id": "sample_001",
    "scores": {"clarity": 4, "accuracy": 5, "tone": 3},
    "overall": 4.0,
    "rationale": "...",
    "latency_ms": 320,
    "tokens_used": 410
  }
]

Challenge package

20 candidate outputs · rubric template · sample expert grades

Download evaluation-design_candidate_package.zip

Includes

  • 20 model outputs to grade (writing-assistant task)
  • Rubric template + 5 expert-graded reference samples
  • Grader starter (LLM-as-judge skeleton) & Dockerfile
  • requirements.txt

Hidden test data and ground-truth answers are not included. Your code runs against them inside the sandbox.

Submit

Max 20MB · must contain run.py and README.md
We're not against AI-assisted coding — we expect it. We just want to see how well you direct an AI. Your prompt is shown alongside your submission and passed to the LLM judge as extra context, so well-structured prompts with clear intent, constraints, and evaluation criteria can lift your score. Leave blank if you didn't use AI.
Only fill this in if a hiring team gave you a code. Submissions with a code go to that company privately and are not added to the public talent pool.

Validation

  • .zip archive only
  • run.py at root or top-level dir
  • README.md present
  • No hidden evaluation files
  • Size ≤ 20MB

Tip: register a profile first — your submission auto-attaches to it when emails match.