答案评估
234 字小于 1 分钟
2026-03-30
当问题集包含 reference_answer 时,可使用评估脚本对比生成答案与标准答案。
评估指标
| 指标 | 说明 |
|---|---|
| Exact Match (EM) | 标准答案(normalize 后)是否完整出现在生成答案中 |
| Token F1 | 基于 token 重叠的 F1 Score,对表述差异更宽容 |
用法
python -m benchmark.evaluate benchmark_results.json指定输出路径:
python -m benchmark.evaluate benchmark_results.json --output my_eval.json输出
终端
=======================================================
Evaluation Report
=======================================================
Total evaluated: 1000
Exact Match: 360/1000 (36.0%)
Avg Token-F1: 0.4521
=======================================================
✗ 未命中的问题 (640 条):
Q: Where does X's wife work at?
A: I don't have information about X's wife...
R: Sunday Times
F1=0.0000JSON 报告
{
"summary": {
"total": 1000,
"exact_match_count": 360,
"exact_match_rate": 0.36,
"avg_f1": 0.4521
},
"details": [
{
"index": 0,
"question": "...",
"answer": "...",
"reference_answer": "...",
"exact_match": true,
"f1": 0.8523
}
]
}典型工作流
# 1. 运行测评
python -m benchmark.run --questions data/bench/2wiki.jsonl \
--concurrency 50 --output bench_results.json
# 2. 评估
python -m benchmark.evaluate bench_results.json
# 3. 对比不同配置
RETRIEVER_MODE=multi_query python -m benchmark.run \
--questions data/bench/2wiki.jsonl --output bench_multi_query.json
python -m benchmark.evaluate bench_multi_query.json