← Benchmarks
coming soon
AI Product Judgement
Scope an LLM feature, define the eval, and decide what to ship under realistic constraints.
Overview
Candidates are handed a fuzzy product brief, a noisy dataset, and a model card. They scope the feature, define the offline + online eval, and write a ship/no-ship memo. Graded by AI product leaders against a hidden rubric.
QuestionsTBD
DomainsTBD
DurationTBD
Slugai-product
Skills assessed
Problem framingEval designQuality bar settingFailure-mode triageCost / latency trade-offsLaunch decisions
Status
This benchmark is being designed. Engineers and hiring partners are giving feedback on the rubric, dataset construction, and runtime. We’ll publish a brief and open submissions once the eval is stable enough to ship signal.
In the meantime, register a profile so we can notify you when it goes live.
Create profileGet notified
Create a profile and we’ll notify you when this benchmark opens.
Create profile