LLM Fine-Tuning
Adapt a base model to a domain-specific task with limited compute.
Overview
Candidates fine-tune a base model on a held-out task and submit weights + a runner. Graded on held-out accuracy, regression on general benchmarks, and training cost efficiency.
Skills assessed
Brief
LLM Fine-Tuning Challenge
Objective
Adapt a base model to a domain-specific task using limited compute, then ship the trained adapter and a runner that emits per-example predictions.
The task: classify and respond to inbound support tickets for a fictional SaaS product, Northstar Workspaces. The base model has never seen this product's terminology and is mediocre out of the box.
What you submit
A repo.zip whose run.py performs both training and inference in a
single invocation, writing a JSON results file. The sandbox provides a
GPU and a 30-minute training budget.
Required CLI
python run.py \
--train /data/train.jsonl \
--val /data/val.jsonl \
--weights /tmp/weights \
--output /tmp/results.json
--trainand--valare JSONL files (one JSON object per line).--weightsis a writable directory; place your final adapter weights here.--outputis the JSON file the harness reads — see the schema below.
Input format
Each line of train.jsonl is a fully-labelled example:
{
"id": "ex_001",
"ticket": "Hey — when I share a workspace with a guest, do they count toward our seat limit?",
"intent": "billing.seat_question",
"ideal_reply": "Guests don't consume a seat unless..."
}
Each line of val.jsonl is input only — your fine-tuned model has to
predict both the intent and the reply. Labels are kept on the host:
{
"id": "val_001",
"ticket": "How do I share a benchmark privately with one company only?"
}
Output format
Write a single JSON object to /tmp/results.json:
{
"training": {
"steps": 600,
"train_loss": 0.42,
"val_loss": 0.58,
"wall_clock_s": 420
},
"predictions": [
{
"id": "ex_001",
"prediction": "Guests don't consume a seat...",
"predicted_intent": "billing.seat_question",
"latency_ms": 88,
"tokens_used": 124
}
]
}
Scoring dimensions
- Intent accuracy (40%) — exact-match on
predicted_intentagainst the canonical taxonomy (synonyms accepted). - Reply quality (30%) — LLM-judged similarity of
predictionto the hidden ideal reply, with per-row must-mention / must-not-mention checks. - Regression probes (20%) — a held-out probe set covering common intents
in fresh paraphrases. Flags fine-tunes that destroyed the base model's
general comprehension. The probes are mixed into the same
val.jsonlfeed as ordinary val rows, so yourrun.pydoesn't need to do anything special — predict an intent and reply for every input row. - Training efficiency (10%) —
wall_clock_srelative to a 600s ideal and a 1800s ceiling. - Reproducibility — your run must be deterministic given the same seed.
Recommended approach
LoRA / QLoRA on a 7B-class base model is more than enough for this dataset. A handful of well-curated examples will outperform brute-force training on the full 30 rows.
Local development
The train.jsonl and val.jsonl shipped here are a small sample so
you can wire your training + inference loop end-to-end on CPU using a
tiny distilled model. The hidden set the sandbox actually trains on is
larger (30 train rows + 50 val rows) and follows the exact same JSONL
shapes documented above (train.jsonl labelled, val.jsonl input-only)
— your run.py should not assume any specific row count.
Iterate locally on the samples; submit when your loop produces a
well-formed /tmp/results.json. The sandbox will run your run.py
against the full hidden set on a GPU and grade predictions[*].predicted_intent
against the canonical taxonomy. The hidden val.jsonl includes both
ordinary val rows and a held-out regression-probe set — you receive them
as one stream of {id, ticket} rows and emit one prediction per id.
Models & cost — how it works
You don't need an OpenAI API key to submit. When your code runs in our
sandbox, we set OPENAI_API_KEY and OPENAI_BASE_URL for you so the
standard openai SDK calls our metering proxy. The proxy forwards to
OpenAI using the operator's key, records every call's real
prompt_tokens / completion_tokens / latency, and the cost panel on
your report is computed from those host-measured numbers — not anything
you self-report.
What this means for you:
- Just write
from openai import OpenAI; client = OpenAI()— the SDK picks up the env vars automatically. No key setup on your end. - You don't need to populate
tokens_used,latency_ms, ormodelinresults.json. Those fields are now ignored. (You may still log them for your own debugging.) - Each submission has a hard upstream budget of $3.00 USD by default. If you blow through it, further OpenAI calls return HTTP 402 and the rest of your run will fail. Pick efficient models.
- Model choice is part of what we evaluate. Solving the task on a small, cheap model demonstrates more skill than reaching for the largest one. Suggested defaults:
gpt-4o-mini— best price/quality default for almost everything.gpt-4.1-mini— budget alternative when reasoning depth matters.- Larger models (
gpt-4o,gpt-4.1,o1,o3-mini) — only reach for these when you can justify it in your README.
Want to test locally before submitting? Use your own OpenAI key and the
real https://api.openai.com base URL — your local runs are billed to
you, not us. The proxy only kicks in when your code runs inside our
evaluation sandbox.
Repository contract
Your zip must match the following layout.
repo.zip ├── run.py # required entrypoint (trains + evaluates) ├── README.md # required ├── requirements.txt # optional └── ... # adapters, training scripts, etc.
Entrypoint
run.py must accept the following arguments.
python run.py \ --train /data/train.jsonl \ --val /data/val.jsonl \ --weights /tmp/weights \ --output /tmp/results.json
Output schema
Write the result to /tmp/results.json.
{
"training": {"steps": 600, "train_loss": 0.42, "val_loss": 0.58, "wall_clock_s": 420},
"predictions": [
{
"id": "ex_001",
"prediction": "...",
"predicted_intent": "billing.seat_question",
"latency_ms": 88,
"tokens_used": 124
}
]
}
Challenge package
30 train + 10 val examples · LoRA starter · eval harness
Download llm-fine-tuning_candidate_package.zipIncludes
- 30-row train.jsonl + 10-row val.jsonl (domain-specific support tickets)
- LoRA fine-tuning starter (PEFT / Transformers)
- Eval harness skeleton & Dockerfile
- requirements.txt
Hidden test data and ground-truth answers are not included. Your code runs against them inside the sandbox.
Submit
Validation
- .zip archive only
- run.py at root or top-level dir
- README.md present
- No hidden evaluation files
- Size ≤ 20MB
Tip: register a profile first — your submission auto-attaches to it when emails match.