← Benchmarks

Overview

Candidates take a reference inference stack and improve tokens/sec, p95 latency, and $/1k requests without dropping below a quality floor. Graded on the Pareto frontier they reach against a held-out workload.

QuestionsTBD
DomainsTBD
DurationTBD
Slugllm-systems

Skills assessed

Batching & schedulingKV-cache optimisationQuantisationSpeculative decodingGPU profilingCost / quality trade-offs

Status

This benchmark is being designed. Engineers and hiring partners are giving feedback on the rubric, dataset construction, and runtime. We’ll publish a brief and open submissions once the eval is stable enough to ship signal.

In the meantime, register a profile so we can notify you when it goes live.

Create profile

Get notified

Create a profile and we’ll notify you when this benchmark opens.

Create profile