Benchmark catalog
Code-first evaluations for AI engineering work. 5 live, 5 coming.
RAG Engineering
Build a retrieval-augmented generation system over an enterprise knowledge base.
Agent Orchestration
Design multi-agent workflows that compose tools to complete enterprise tasks.
LLM Fine-Tuning
Adapt a base model to a domain-specific task with limited compute.
Evaluation Design
Build a robust eval harness for an open-ended generation task.
Prompt Optimization
Systematically improve a baseline prompt against a held-out test set.
LLM Reasoning
Build and evaluate a reasoning loop on multi-step logic, math, and planning tasks.
AI Safety & Red-Team
Stress-test an LLM application for jailbreaks, prompt injection, data leaks, and unsafe outputs.
LLM Systems & Inference
Optimise an inference stack for latency, throughput, and cost under a fixed quality bar.
AI Product Judgement
Scope an LLM feature, define the eval, and decide what to ship under realistic constraints.
Conversational UX
Design and evaluate a multi-turn assistant for tone, recovery, and task completion.