← Benchmarks

Overview

Candidates fine-tune a base model on a held-out task and submit weights + a runner. Graded on held-out accuracy, regression on general benchmarks, and training cost efficiency.

Questions60
Domains3
Duration8–16 hours
Slugllm-fine-tuning

Skills assessed

Data curationLoRA / PEFTEval harnessCatastrophic forgettingInference costReproducibility

Brief

LLM Fine-Tuning Challenge

Objective

Adapt a base model to a domain-specific task using limited compute, then ship the trained adapter and a runner that emits per-example predictions.

The task: classify and respond to inbound support tickets for a fictional SaaS product, Northstar Workspaces. The base model has never seen this product's terminology and is mediocre out of the box.

What you submit

A repo.zip whose run.py performs both training and inference in a single invocation, writing a JSON results file. The sandbox provides a GPU and a 30-minute training budget.

Required CLI

python run.py \
  --train   /data/train.jsonl \
  --val     /data/val.jsonl \
  --weights /tmp/weights \
  --output  /tmp/results.json
  • --train and --val are JSONL files (one JSON object per line).
  • --weights is a writable directory; place your final adapter weights here.
  • --output is the JSON file the harness reads — see the schema below.

Input format

Each line of train.jsonl is a fully-labelled example:

{
  "id": "ex_001",
  "ticket": "Hey — when I share a workspace with a guest, do they count toward our seat limit?",
  "intent": "billing.seat_question",
  "ideal_reply": "Guests don't consume a seat unless..."
}

Each line of val.jsonl is input only — your fine-tuned model has to predict both the intent and the reply. Labels are kept on the host:

{
  "id": "val_001",
  "ticket": "How do I share a benchmark privately with one company only?"
}

Output format

Write a single JSON object to /tmp/results.json:

{
  "training": {
    "steps": 600,
    "train_loss": 0.42,
    "val_loss": 0.58,
    "wall_clock_s": 420
  },
  "predictions": [
    {
      "id": "ex_001",
      "prediction": "Guests don't consume a seat...",
      "predicted_intent": "billing.seat_question",
      "latency_ms": 88,
      "tokens_used": 124
    }
  ]
}

Scoring dimensions

  • Intent accuracy (40%) — exact-match on predicted_intent against the canonical taxonomy (synonyms accepted).
  • Reply quality (30%) — LLM-judged similarity of prediction to the hidden ideal reply, with per-row must-mention / must-not-mention checks.
  • Regression probes (20%) — a held-out probe set covering common intents in fresh paraphrases. Flags fine-tunes that destroyed the base model's general comprehension. The probes are mixed into the same val.jsonl feed as ordinary val rows, so your run.py doesn't need to do anything special — predict an intent and reply for every input row.
  • Training efficiency (10%)wall_clock_s relative to a 600s ideal and a 1800s ceiling.
  • Reproducibility — your run must be deterministic given the same seed.

Recommended approach

LoRA / QLoRA on a 7B-class base model is more than enough for this dataset. A handful of well-curated examples will outperform brute-force training on the full 30 rows.

Local development

The train.jsonl and val.jsonl shipped here are a small sample so you can wire your training + inference loop end-to-end on CPU using a tiny distilled model. The hidden set the sandbox actually trains on is larger (30 train rows + 50 val rows) and follows the exact same JSONL shapes documented above (train.jsonl labelled, val.jsonl input-only) — your run.py should not assume any specific row count.

Iterate locally on the samples; submit when your loop produces a well-formed /tmp/results.json. The sandbox will run your run.py against the full hidden set on a GPU and grade predictions[*].predicted_intent against the canonical taxonomy. The hidden val.jsonl includes both ordinary val rows and a held-out regression-probe set — you receive them as one stream of {id, ticket} rows and emit one prediction per id.

Models & cost — how it works

You don't need an OpenAI API key to submit. When your code runs in our sandbox, we set OPENAI_API_KEY and OPENAI_BASE_URL for you so the standard openai SDK calls our metering proxy. The proxy forwards to OpenAI using the operator's key, records every call's real prompt_tokens / completion_tokens / latency, and the cost panel on your report is computed from those host-measured numbers — not anything you self-report.

What this means for you:

  • Just write from openai import OpenAI; client = OpenAI() — the SDK picks up the env vars automatically. No key setup on your end.
  • You don't need to populate tokens_used, latency_ms, or model in results.json. Those fields are now ignored. (You may still log them for your own debugging.)
  • Each submission has a hard upstream budget of $3.00 USD by default. If you blow through it, further OpenAI calls return HTTP 402 and the rest of your run will fail. Pick efficient models.
  • Model choice is part of what we evaluate. Solving the task on a small, cheap model demonstrates more skill than reaching for the largest one. Suggested defaults:
  • gpt-4o-mini — best price/quality default for almost everything.
  • gpt-4.1-mini — budget alternative when reasoning depth matters.
  • Larger models (gpt-4o, gpt-4.1, o1, o3-mini) — only reach for these when you can justify it in your README.

Want to test locally before submitting? Use your own OpenAI key and the real https://api.openai.com base URL — your local runs are billed to you, not us. The proxy only kicks in when your code runs inside our evaluation sandbox.

Repository contract

Your zip must match the following layout.

repo.zip
├── run.py            # required entrypoint (trains + evaluates)
├── README.md         # required
├── requirements.txt  # optional
└── ...               # adapters, training scripts, etc.

Entrypoint

run.py must accept the following arguments.

python run.py \
  --train   /data/train.jsonl \
  --val     /data/val.jsonl \
  --weights /tmp/weights \
  --output  /tmp/results.json

Output schema

Write the result to /tmp/results.json.

{
  "training": {"steps": 600, "train_loss": 0.42, "val_loss": 0.58, "wall_clock_s": 420},
  "predictions": [
    {
      "id": "ex_001",
      "prediction": "...",
      "predicted_intent": "billing.seat_question",
      "latency_ms": 88,
      "tokens_used": 124
    }
  ]
}

Challenge package

30 train + 10 val examples · LoRA starter · eval harness

Download llm-fine-tuning_candidate_package.zip

Includes

  • 30-row train.jsonl + 10-row val.jsonl (domain-specific support tickets)
  • LoRA fine-tuning starter (PEFT / Transformers)
  • Eval harness skeleton & Dockerfile
  • requirements.txt

Hidden test data and ground-truth answers are not included. Your code runs against them inside the sandbox.

Submit

Max 20MB · must contain run.py and README.md
We're not against AI-assisted coding — we expect it. We just want to see how well you direct an AI. Your prompt is shown alongside your submission and passed to the LLM judge as extra context, so well-structured prompts with clear intent, constraints, and evaluation criteria can lift your score. Leave blank if you didn't use AI.
Only fill this in if a hiring team gave you a code. Submissions with a code go to that company privately and are not added to the public talent pool.

Validation

  • .zip archive only
  • run.py at root or top-level dir
  • README.md present
  • No hidden evaluation files
  • Size ≤ 20MB

Tip: register a profile first — your submission auto-attaches to it when emails match.