← Benchmarks

Overview

Candidates design and ship a multi-turn assistant for a target use case. Graded on task completion, conversational repair, tone consistency, and user-rated trust across a panel of testers.

QuestionsTBD
DomainsTBD
DurationTBD
Slugconversational-ux

Skills assessed

Persona & tone designTurn-takingError recoveryDisambiguationPersuasion ethicsTask completion metrics

Status

This benchmark is being designed. Engineers and hiring partners are giving feedback on the rubric, dataset construction, and runtime. We’ll publish a brief and open submissions once the eval is stable enough to ship signal.

In the meantime, register a profile so we can notify you when it goes live.

Create profile

Get notified

Create a profile and we’ll notify you when this benchmark opens.

Create profile