← Benchmarks

Overview

Candidates build a retrieval pipeline over 40 documents across Finance, Security, Product, Pricing, and Support. The runtime grades 120 hidden questions across 7 dimensions.

Questions120
Domains5
Duration4–8 hours
Slugrag-engineering

Skills assessed

RetrievalRe-rankingCitationsHallucination controlMulti-hop reasoningPermission awareness

Brief

AI Engineer RAG Challenge

Objective

Build a retrieval-augmented generation system over the provided enterprise knowledge base.

Your system should: - Answer questions using only the provided documents. - Return source filenames supporting each answer. - Refuse when the documents do not contain enough information. - Prefer current/final documents over draft, deprecated, or superseded documents. - Handle plan entitlements, compliance limitations, and cross-document reasoning. - Use the provided model gateway for all chat and embedding calls.

Required CLI

Your submission must support:

python run.py \
  --documents /data/documents \
  --questions /data/questions.json \
  --output /tmp/results.json

Input format

[
  {
    "id": "q_001",
    "question": "What was Northstar AI's final FY2024 ARR?"
  }
]

Output format

[
  {
    "id": "q_001",
    "answer": "Northstar AI's final FY2024 ARR was $18.4 million.",
    "sources": ["finance_01_fy2024_board_update_final.md"]
  }
]

Model access

The platform will inject:

MODEL_GATEWAY_URL
MODEL_GATEWAY_KEY
CHAT_MODEL
EMBEDDING_MODEL

You must not use external LLM APIs, search engines, browser tools, or private model accounts.

README requirements

Your README should explain: 1. Chunking strategy. 2. Retrieval method. 3. Reranking, if used. 4. Refusal strategy for insufficient evidence. 5. Conflict-resolution strategy. 6. Citation strategy. 7. Latency and token-cost optimizations. 8. Known limitations.

Repository contract

Your zip must match the following layout.

repo.zip
├── run.py            # required entrypoint
├── README.md         # required
├── requirements.txt  # optional
└── ...               # your implementation

Entrypoint

run.py must accept the following arguments.

python run.py \
  --documents /data/documents \
  --questions /data/questions.json \
  --output    /tmp/results.json

Output schema

Write the result to /tmp/results.json.

[
  {
    "id": "q_001",
    "answer": "...",
    "sources": ["doc_name.md"],
    "latency_ms": 124,
    "tokens_used": 312
  }
]

Challenge package

40 documents · 5 domains · instructions · starter code

Download rag-engineering_candidate_package.zip

Includes

  • 40 enterprise documents
  • Brief and entrypoint spec
  • Starter code & Dockerfile
  • requirements.txt

Hidden test data and ground-truth answers are not included. Your code runs against them inside the sandbox.

Submit

Max 20MB · must contain run.py and README.md
We're not against AI-assisted coding — we expect it. We just want to see how well you direct an AI. Your prompt is shown alongside your submission and passed to the LLM judge as extra context, so well-structured prompts with clear intent, constraints, and evaluation criteria can lift your score. Leave blank if you didn't use AI.
Only fill this in if a hiring team gave you a code. Submissions with a code go to that company privately and are not added to the public talent pool.

Validation

  • .zip archive only
  • run.py at root or top-level dir
  • README.md present
  • No hidden evaluation files
  • Size ≤ 20MB

Tip: register a profile first — your submission auto-attaches to it when emails match.