Benchmark Runner
About 583 wordsAbout 2 min
2026-03-30
The benchmark/ package provides concurrent inference benchmarking for DataMind, calling the Python API directly (no HTTP).
Features
- Configurable concurrency
- Per-request session isolation (no memory cross-contamination)
- Real-time progress bar
- Latency statistics (Avg / P50 / P90 / P95 / Max)
- Throughput (QPS)
- Optional
reference_answerpassthrough for evaluation
Question Set Format
JSONL, one JSON per line:
{"question": "What is RAG?"}
{"question": "When was X born?", "reference_answer": "1982", "question_id": "q_001"}| Field | Type | Required | Description |
|---|---|---|---|
question | string | Yes | Question text |
reference_answer | string | No | Ground truth (for evaluation) |
question_id | string | No | Question ID (for tracking) |
Usage
# Basic (5 concurrent)
python -m benchmark.run --questions data/bench/2wiki.jsonl
# Custom concurrency
python -m benchmark.run --questions data/bench/2wiki.jsonl --concurrency 50
# Custom output
python -m benchmark.run --questions data/bench/2wiki.jsonl --output results.json
# Switch data profile (use different knowledge base)
DATA_PROFILE=2wiki python -m benchmark.run --questions data/bench/2wiki.jsonl
# Combined: different profile + retrieval strategy
DATA_PROFILE=2wiki RETRIEVER_MODE=multi_query SIMILARITY_TOP_K=5 LLM_MODEL=gpt-4o \
python -m benchmark.run --questions data/bench/2wiki.jsonl --concurrency 50Switch Config via Environment
RETRIEVER_MODE=multi_query python -m benchmark.run --questions data/bench/2wiki.jsonl
LLM_MODEL=gpt-4o python -m benchmark.run --questions data/bench/2wiki.jsonl
SIMILARITY_TOP_K=5 python -m benchmark.run --questions data/bench/2wiki.jsonlMultimodal RAG Benchmark
When benchmarking a knowledge base that contains images, set IMAGE_EMBEDDING_MODE as well:
# vlm_describe mode: VLM describes images as text before embedding
DATA_PROFILE=mm_demo IMAGE_EMBEDDING_MODE=vlm_describe \
python -m benchmark.run --questions data/bench/mm_questions.jsonl --concurrency 5
# clip mode: CLIP for unified image-text embedding
DATA_PROFILE=mm_demo IMAGE_EMBEDDING_MODE=clip \
python -m benchmark.run --questions data/bench/mm_questions.jsonl --concurrency 5Things to keep in mind for multimodal benchmarks:
- Higher first-run latency: In
vlm_describemode, the first run calls the VLM API to generate a description for each image, so index building takes longer. Subsequent runs load the cached index and latency returns to normal. - Images must be inside the profile directory:
image_pathis relative todata/profiles/{profile}/; make sure the image files exist. - Questions should target image content: The question set for a multimodal benchmark should include questions that require information from images to answer; otherwise there is no difference from a text-only benchmark.
Output
Terminal
[INFO] Loaded 1000 questions (360 with reference answers), concurrency: 50
Running 1000 queries (concurrency=50) ...
[████████████████████████████████████████] 1000/1000 (100.0%)
==================================================
Benchmark Results
==================================================
Total queries: 1000
Concurrency: 50
Errors: 0
Wall time: 168.090s
--------------------------------------------------
Avg latency: 8.095s
P50 latency: 7.036s
P95 latency: 15.612s
Throughput: 5.95 QPS
==================================================JSON Result File
Each record:
{
"index": 0,
"question": "Where does X's wife work at?",
"answer": "According to the information...",
"error": null,
"latency_s": 5.632,
"reference_answer": "Sunday Times",
"question_id": "9d054e98..."
}reference_answer and question_id appear only when present in the question set.
Using Public RAG Datasets
Recommended: A-RAG Benchmark
| Dataset | Chunks | Questions | Notes |
|---|---|---|---|
2wikimultihop | 658 | 1,000 | Multi-hop reasoning, smallest |
hotpotqa | 1,311 | 1,000 | Multi-hop reasoning |
musique | 1,354 | 1,000 | 2-4 hop reasoning |
medical | 225 | 2,062 | Medical domain |
novel | 1,117 | 2,010 | Long-form literature |
See Data Format for conversion instructions.
Reference Results
Results from 2WikiMultiHop dataset (658 chunks, gpt-4o) at various concurrency levels:
| Concurrency | Questions | Errors | Wall Time | Avg Latency | P50 | P95 | Throughput |
|---|---|---|---|---|---|---|---|
| 3 | 20 | 0 | 56.1s | 6.32s | 4.75s | 32.99s | 0.36 QPS |
| 30 | 20 | 0 | 16.1s | 7.33s | 7.28s | 16.12s | 1.24 QPS |
| 50 | 1000 | 0 | 168.1s | 8.10s | 7.04s | 15.61s | 5.95 QPS |
Accuracy (reference answer contained in response): 36.0%
2WikiMultiHop is a multi-hop reasoning dataset; single-pass vector retrieval with top_k=3 is inherently challenging. Use
RETRIEVER_MODE=multi_queryor increaseSIMILARITY_TOP_Kto improve recall.
