← Benchmarks
coming soon
LLM Reasoning
Build and evaluate a reasoning loop on multi-step logic, math, and planning tasks.
Overview
Candidates ship a reasoning system that solves multi-step problems across math, logic puzzles, and constrained planning. Graded on correctness, sample efficiency, and how robustly the verifier catches its own mistakes.
QuestionsTBD
DomainsTBD
DurationTBD
Slugllm-reasoning
Skills assessed
Chain-of-thought designSelf-critiqueVerifier constructionPlanner / executor splitSearch & backtrackingHallucination control
Status
This benchmark is being designed. Engineers and hiring partners are giving feedback on the rubric, dataset construction, and runtime. We’ll publish a brief and open submissions once the eval is stable enough to ship signal.
In the meantime, register a profile so we can notify you when it goes live.
Create profileGet notified
Create a profile and we’ll notify you when this benchmark opens.
Create profile