← Benchmarks

Overview

Candidates build a retrieval pipeline over 40 documents across Finance, Security, Product, Pricing, and Support. The runtime grades 120 hidden questions across 7 dimensions.

Questions120

Domains5

Duration4–8 hours

Slugrag-engineering

Skills assessed

RetrievalRe-rankingCitationsHallucination controlMulti-hop reasoningPermission awareness

Brief

AI Engineer RAG Challenge

Objective

Build a retrieval-augmented generation system over the provided enterprise knowledge base.

Your system should: - Answer questions using only the provided documents. - Return source filenames supporting each answer. - Refuse when the documents do not contain enough information. - Prefer current/final documents over draft, deprecated, or superseded documents. - Handle plan entitlements, compliance limitations, and cross-document reasoning. - Use the provided model gateway for all chat and embedding calls.

Required CLI

Your submission must support:

python run.py \
  --documents /data/documents \
  --questions /data/questions.json \
  --output /tmp/results.json

Input format

[
  {
    "id": "q_001",
    "question": "What was Northstar AI's final FY2024 ARR?"
  }
]

Output format

[
  {
    "id": "q_001",
    "answer": "Northstar AI's final FY2024 ARR was $18.4 million.",
    "sources": ["finance_01_fy2024_board_update_final.md"]
  }
]

Model access

The platform will inject:

MODEL_GATEWAY_URL
MODEL_GATEWAY_KEY
CHAT_MODEL
EMBEDDING_MODEL

You must not use external LLM APIs, search engines, browser tools, or private model accounts.

README requirements

Your README should explain: 1. Chunking strategy. 2. Retrieval method. 3. Reranking, if used. 4. Refusal strategy for insufficient evidence. 5. Conflict-resolution strategy. 6. Citation strategy. 7. Latency and token-cost optimizations. 8. Known limitations.

Repository contract

Your zip must match the following layout.

repo.zip
├── run.py            # required entrypoint
├── README.md         # required
├── requirements.txt  # optional
└── ...               # your implementation

Entrypoint

run.py must accept the following arguments.

python run.py \
  --documents /data/documents \
  --questions /data/questions.json \
  --output    /tmp/results.json

Output schema

Write the result to /tmp/results.json.

[
  {
    "id": "q_001",
    "answer": "...",
    "sources": ["doc_name.md"],
    "latency_ms": 124,
    "tokens_used": 312
  }
]

Challenge package

40 documents · 5 domains · instructions · starter code

Download rag-engineering_candidate_package.zip

Includes

40 enterprise documents
Brief and entrypoint spec
Starter code & Dockerfile
requirements.txt

Hidden test data and ground-truth answers are not included. Your code runs against them inside the sandbox.

Submit

Name

repo.zip Max 20MB · must contain run.py and README.md

Prompt(s) you used (optional) We're not against AI-assisted coding — we expect it. We just want to see how well you direct an AI. Your prompt is shown alongside your submission and passed to the LLM judge as extra context, so well-structured prompts with clear intent, constraints, and evaluation criteria can lift your score. Leave blank if you didn't use AI.

Company code (optional) Only fill this in if a hiring team gave you a code. Submissions with a code go to that company privately and are not added to the public talent pool.

Validation

.zip archive only
run.py at root or top-level dir
README.md present
No hidden evaluation files
Size ≤ 20MB

Tip: register a profile first — your submission auto-attaches to it when emails match.

RAG Engineering