Evaluation Design
Build a robust eval harness for an open-ended generation task.
Overview
Candidates author a grading harness for a free-form generation task. Their rubric is itself graded against expert agreement on a held-out set.
Skills assessed
Brief
Evaluation Design Challenge
Objective
Build a grading harness for a free-form generation task. The task is a writing assistant that turns terse engineering tickets into polished customer-facing release notes. Your harness will be re-applied to a hidden set of model outputs and graded against expert agreement.
You are graded on how well your rubric tracks expert grades, not on how generous or strict your scores are.
What you submit
A repo.zip whose run.py:
- Reads the candidate model outputs in
samples.json. - Writes your rubric to
/tmp/rubric.json. - Writes per-sample scores to
/tmp/results.json.
You may use the model gateway as an LLM-as-judge — most strong submissions do.
Required CLI
python run.py \
--samples /data/samples.json \
--rubric /tmp/rubric.json \
--output /tmp/results.json
Input format
samples.json is a JSON array:
[
{
"id": "sample_001",
"input_ticket": "JIRA-4421: backend now returns 429 instead of 503 when caller exceeds rate limit. update docs and changelog.",
"model_output": "We've updated our APIs to return HTTP 429 instead of 503 when..."
}
]
Rubric format
rubric.json is your design artefact. Suggested shape:
{
"name": "release_notes_rubric_v1",
"dimensions": [
{"key": "clarity", "scale": [1, 5], "description": "..."},
{"key": "accuracy", "scale": [1, 5], "description": "..."},
{"key": "tone", "scale": [1, 5], "description": "..."}
],
"weights": {"clarity": 0.4, "accuracy": 0.4, "tone": 0.2},
"grading_prompt": "..."
}
You may add more dimensions, free-form criteria, or anchor examples — the grader only needs to be able to reproduce your scores.
Output format
results.json is a JSON array, one entry per sample:
[
{
"id": "sample_001",
"scores": {"clarity": 4, "accuracy": 5, "tone": 3},
"overall": 4.0,
"rationale": "Clear and accurate, tone is slightly too formal for changelog voice.",
"latency_ms": 320,
"tokens_used": 410
}
]
Scoring dimensions
- Inter-rater agreement — Cohen's κ between your scores and the held-out expert scores on the hidden set.
- Bias detection — does your rubric penalise length, formality, or hedging in ways the expert grader does not?
- Statistical power — your rubric must distinguish good from mediocre outputs at p < 0.05 with 30 samples.
- Cost-efficient grading —
tokens_usedper sample is penalised at the margin. - Reporting — your rationale must be specific enough for the original author to act on.
Local development
This package ships 5 sample model outputs in samples.json and the
corresponding 5 expert grades in expert_grades.json so you can
calibrate your rubric locally. The hidden set the sandbox actually
runs your grader against is larger (20 samples) and follows the exact
same JSON shape — your run.py should not assume any specific sample
count, and it will not see the hidden expert grades.
Inter-rater agreement between your scores and the held-out experts is the primary metric. Spend most of your time tightening the rubric, not on prompt micro-tweaks.
Models & cost — how it works
You don't need an OpenAI API key to submit. When your code runs in our
sandbox, we set OPENAI_API_KEY and OPENAI_BASE_URL for you so the
standard openai SDK calls our metering proxy. The proxy forwards to
OpenAI using the operator's key, records every call's real
prompt_tokens / completion_tokens / latency, and the cost panel on
your report is computed from those host-measured numbers — not anything
you self-report.
What this means for you:
- Just write
from openai import OpenAI; client = OpenAI()— the SDK picks up the env vars automatically. No key setup on your end. - You don't need to populate
tokens_used,latency_ms, ormodelinresults.json. Those fields are now ignored. (You may still log them for your own debugging.) - Each submission has a hard upstream budget of $3.00 USD by default. If you blow through it, further OpenAI calls return HTTP 402 and the rest of your run will fail. Pick efficient models.
- Model choice is part of what we evaluate. Solving the task on a small, cheap model demonstrates more skill than reaching for the largest one. Suggested defaults:
gpt-4o-mini— best price/quality default for almost everything.gpt-4.1-mini— budget alternative when reasoning depth matters.- Larger models (
gpt-4o,gpt-4.1,o1,o3-mini) — only reach for these when you can justify it in your README.
Want to test locally before submitting? Use your own OpenAI key and the
real https://api.openai.com base URL — your local runs are billed to
you, not us. The proxy only kicks in when your code runs inside our
evaluation sandbox.
Repository contract
Your zip must match the following layout.
repo.zip ├── run.py # required entrypoint ├── README.md # required ├── requirements.txt # optional └── ... # rubric, grader, calibration code
Entrypoint
run.py must accept the following arguments.
python run.py \ --samples /data/samples.json \ --rubric /tmp/rubric.json \ --output /tmp/results.json
Output schema
Write the result to /tmp/results.json.
[
{
"id": "sample_001",
"scores": {"clarity": 4, "accuracy": 5, "tone": 3},
"overall": 4.0,
"rationale": "...",
"latency_ms": 320,
"tokens_used": 410
}
]
Challenge package
20 candidate outputs · rubric template · sample expert grades
Download evaluation-design_candidate_package.zipIncludes
- 20 model outputs to grade (writing-assistant task)
- Rubric template + 5 expert-graded reference samples
- Grader starter (LLM-as-judge skeleton) & Dockerfile
- requirements.txt
Hidden test data and ground-truth answers are not included. Your code runs against them inside the sandbox.
Submit
Validation
- .zip archive only
- run.py at root or top-level dir
- README.md present
- No hidden evaluation files
- Size ≤ 20MB
Tip: register a profile first — your submission auto-attaches to it when emails match.