Educational AI Knowledge Graph Benchmark & Training Data

K12-KGraph

A Curriculum-Aligned Knowledge Graph for Benchmarking & Training Educational LLMs

Built from official People's Education Press (PEP) Chinese K–12 textbooks across Mathematics, Physics, Chemistry, and Biology.

Hao Liang Peking University
10,685
KG nodes
7 node types
23,278
KG edges
9 relation types
23,640
Bench questions
5 task families
2,267
Train pairs
KG-guided SFT
4
Subjects
Math · Phys · Chem · Bio

01 Overview

Large language models have strong factual recall but often fail on curriculum cognition: which concepts come first, which skills a problem tests, and where a topic sits in the textbook hierarchy. K12-KGraph is a three-part resource that grounds evaluation and training in structured textbook knowledge.

  • K12-KGraph. 10,685 nodes and 23,278 edges extracted from official PEP textbooks with OCR, schema-constrained LLM extraction, and human verification.
  • K12-Bench. 23,640 multi-select questions across five curriculum tasks: Ground, Prereq, Neighbor, Evidence, and Locate.
  • K12-Train. 2,267 KG-guided SFT pairs that outperform same-sized subsets of 8 mainstream SFT corpora on GaokaoBench and EduEval.
K12-KGraph schema illustration
Schema of K12-KGraph: 7 node types & 9 relations grounded in textbook structure.

02 Three Resources

K12-KGraph

Knowledge Graph

A curriculum-aligned KG over four subjects, distilled from the official PEP textbook corpus.

  • Nodes: Concept, Skill, Experiment, Exercise, Section, Chapter, Book
  • Relations: is_a, prerequisites_for, relates_to, verifies, tests_concept, tests_skill, appears_in, is_part_of, leads_to
  • Quality: OCR + schema-constrained extraction + structural validation + expert review

K12-Bench

Evaluation

23,640 multi-select questions that isolate five dimensions of curriculum cognition.

  • Ground · which concepts does the question assume?
  • Prereq · what must be learned first?
  • Neighbor · what is structurally adjacent?
  • Evidence · which sections justify the answer?
  • Locate · where in the textbook does this appear?

K12-Train

SFT Data

2,267 compact, KG-guided QA pairs designed to teach curriculum-aware reasoning.

  • Source: Synthesized on top of K12-KGraph subgraphs
  • Efficiency: Beats 8 mainstream SFT corpora at equal budget
  • Validated on: GaokaoBench & EduEval across two backbones

03 K12-Bench at a Glance

Each K12-Bench question is derived deterministically from a K12-KGraph subgraph, which keeps answer keys auditable and makes the benchmark difficulty controllable by subject, grade, and task family.

The benchmark is balanced across four subjects and five task families, giving a compact yet rigorous probe of curriculum cognition in LLMs.

K12-Bench distribution across subjects, tasks, and difficulty.
Coverage of K12-Bench across subjects and the five curriculum tasks.

04 Explore the Data

Real samples from the dataset — click a tab to switch the view.

Seven node types

Concepts, skills, exercises, and experiments are anchored within the textbook hierarchy: Book → Chapter → Section.

Concept
正数
大于 0 的数叫做正数。有时为了明确意义,也可以在前面加 "+" 号。
Importance: 掌握 Examples: 3, +0.5
Skill
用正负数表示实际问题中的相反量
根据量的方向或意义(如增减、收入支出、上升下降)选择用正数或负数表达。
Anchored via tests_skill
Exercise
小明体重增加 2kg,小华减少 1kg,小强无变化。写出这个月的体重增长值。
Answer: 小明 2kg · 小华 -1kg · 小强 0kg
Difficulty: 1 Type: 应用题
Experiment
实验 4-1 锌铜原电池中盐桥作用的探究
通过对比有/无盐桥时的电流现象,验证盐桥在保持电中性、维持电路闭合中的作用。
verifies → 盐桥
Book
七年级上册
数学 · 人教版
Chapter
第一章 有理数
contains 15 sections
Section
第一节 正数和负数
Leaf of the hierarchy

Nine relation types

Every edge is traceable to textbook evidence. Hover a card to see the source passage that justifies it.

is_ataxonomy
整数 有理数
"整数和分数统称为有理数。"
prerequisites_forlearning order
正数 相反意义的量
"如果一个问题中出现相反意义的量,可以用正数和负数分别表示它们。"
relates_toassociation
正数 负数
"正数与负数是相对的概念;'负'与'正'相对。"
tests_conceptexercise → concepts
Exercise #1 正数 负数 0
One exercise probes multiple concepts at once.
tests_skillexercise → skill
Exercise #1 用正负数表示相反量
Maps exercises to the skill they practise.
verifiesexperiment → law
实验 4-1 盐桥作用
Links lab experiments to the concept they verify.
appears_ingrounding
正数 第一节 正数和负数
Anchors every node back to its textbook section.
is_part_ofhierarchy
第一节 第一章 七年级上册
Section → Chapter → Book spine.
leads_toprogression
有理数 实数
Cross-chapter conceptual progression.

Five task families on K12-Bench

Each question is deterministically derived from a KG subgraph, so answer keys are auditable and distractors are systematic.

01
Ground
Which concepts/skills does this problem actually test?
Physics
【写出 Po-210 的 β 衰变方程。】这道题考察了什么核心方法?
A 光电效应方程分析
B 统计分析方法
C 能量守恒分析
D 核反应方程书写方法
02
Prereq
What must be learned before this concept?
Physics
要掌握物理中的【熵】,应先具备哪些基础?
A 微观态与宏观态
B 熵增加原理
C 热力学第二定律
D 功与热的等效性
03
Neighbor
Which concepts are structurally adjacent in the KG?
Biology
以下哪些概念与生物学中的【多倍体】直接相关(包括分类关系或紧密关联)?
A 三倍体不育
B 四倍体
C 染色体组
D 人工诱导多倍体
04
Evidence
Which textbook experiment supports this claim?
Chemistry
以下哪些实验可以验证【盐桥】?
A 牺牲阳极法电解实验
B 实验 4-1 锌铜原电池中盐桥作用的探究
C 铁钉的吸氧腐蚀实验
D CuCl₂ 溶液的电解实验
05
Locate
Where in the textbook does this appear first?
Physics
学生第一次学习物理中的【熵】是在以下哪个章节?
A 第二节 热力学第一定律
B 第四节 热力学第二定律
C 第三节 能量守恒定律
D 第一节 功、热和内能的改变

KG-guided SFT pairs

Each training example anchors a node (definition-style) or an edge (relational reasoning) from K12-KGraph.

Node-centric anchor: 有理数
Question
什么是有理数?它和整数、分数之间是什么关系?
Answer
整数和分数统称为有理数。也就是说,整数和分数都是有理数的子集,它们合在一起构成有理数这一更大的类别。
Edge-centric anchor: 微观态与宏观态 prerequisites_for
Question
在学习"熵"之前,为什么应该先掌握"微观态与宏观态"?
Answer
熵从微观角度描述了一个宏观态对应的微观态数目的多少。没有"微观态与宏观态"这一前置概念,就无法从统计力学的角度理解熵所度量的"无序程度"。
Node-centric anchor:
Question
秒是什么时间单位?
Answer
秒是一种计量很短时间的基本单位。它是我们用来测量瞬间或短暂事件的时间单位。
Edge-centric anchor: 毫米 is_a 厘米
Question
为什么说毫米是厘米的更小单位?
Answer
因为厘米是长度单位,毫米也是长度单位,且 1 厘米等于 10 毫米,所以毫米是厘米的更小单位。

05 Key Findings

46%
Curriculum cognition is an open gap. Even the strongest open-source model only reaches roughly 46% exact match on K12-Bench, despite near-saturated factual recall on standard K-12 knowledge tests.
2.3k
Sample-efficient training. About 2,300 K12-Train examples surpass equally sized subsets of eight mainstream SFT corpora on both GaokaoBench and EduEval.
Consistent across backbones. Gains transfer across two model families (Qwen and LLaMA-based) and scale from 7B to larger instruction-tuned checkpoints.
Takeaway. Grounding both evaluation and supervision in a curriculum-aligned KG is a promising recipe for curriculum-aware educational AI, well beyond the Chinese K–12 setting.

06 Experimental Results

K12-Train is compared against eight mainstream instruction-tuning corpora under a matched 2,300-sample budget on two base models. All results in one place.

Zero-shot, answer-only. Exact-match (EM) over 23,640 multi-select items across 5 task families. Even the strongest model reaches just 57.1%.

Open-source models
Meta-LLaMA-3-8B-Instruct
7.2%
GLM-4.7-Flash
31.7%
Ministral-3-14B-Instruct
37.5%
Qwen3-32B
42.6%
Gemma-4-31B-IT
46.4%
Proprietary APIs
GPT-4o
31.1%
GPT-5-mini
31.7%
GPT-5.2
42.8%
Gemini-2.5-Flash
48.3%
Gemini-3-Flash
57.1%
Open-source Proprietary Best in family x-axis: overall Exact-Match on K12-Bench

Total score across 10 subjects after SFT on 2,300 samples from each corpus. K12-Train wins on both backbones.

Qwen3-4B-Base
Total / 1,200
Base (no SFT)
445.4
Qwen3-4B-Instruct
895.4
+ 2,300 SFT samples from:
LMSYS
782.4
UltraChat
902.6
WizardLM
922.8
Tulu-3-SFT
951.7
Infinity-Instruct
953.7
SmolTalk
963.6
OpenHermes-2.5
966.8
DataFlow-10K
985.9
K12-Train (ours)
1009.96
+24.1 over strongest baseline (DataFlow) · +114.6 over Qwen3-4B-Instruct
Llama3.1-8B-Base
Total / 1,200
Base (no SFT)
130.8
Llama3.1-8B-Instruct
404.5
+ 2,300 SFT samples from:
LMSYS
402.6
SmolTalk
555.6
Infinity-Instruct
573.3
OpenHermes-2.5
580.2
DataFlow-10K
585.0
Tulu-3-SFT
587.0
UltraChat
592.2
WizardLM
593.1
K12-Train (ours)
625.49
+32.4 over strongest baseline (WizardLM) · +220.9 over Llama3.1-8B-Instruct

Average across 18 educational sub-tasks covering 5 capability dimensions (Application / Ethics / Memory / Reasoning / Understanding).

Qwen3-4B-Base
EduEval Avg / 100
Base (no SFT)
34.5
Qwen3-4B-Instruct
66.30
+ 2,300 SFT samples from:
LMSYS
64.68
UltraChat
65.44
DataFlow-10K
65.57
OpenHermes-2.5
66.01
Infinity-Instruct
66.04
SmolTalk
66.10
Tulu-3-SFT
66.27
WizardLM
66.70
K12-Train (ours)
66.76
Best average + strongest Ethics scores among all 2.3k-budget runs.
Llama3.1-8B-Base
EduEval Avg / 100
Base (no SFT)
20.98
Llama3.1-8B-Instruct
27.31
+ 2,300 SFT samples from:
LMSYS
29.85
UltraChat
31.87
DataFlow-10K
34.53
Infinity-Instruct
37.33
SmolTalk
37.64
Tulu-3-SFT
37.68
WizardLM
38.58
OpenHermes-2.5
40.05
K12-Train (ours)
40.90
+0.85 over strongest baseline (OpenHermes) · +13.59 over Llama3.1-8B-Instruct
Takeaway. At an equal 2.3k-sample budget, K12-Train is the only corpus that wins on both GaokaoBench and EduEval and on both backbones, pointing to curriculum-grounded synthesis as the right recipe for educational SFT.

07 Example

Example K12-Bench question and K12-Train pair derived from a subgraph.
An end-to-end example: KG subgraph → multi-select benchmark item → KG-guided training pair.

08 Access & Licensing

09 Paper & Citation

A preprint will be released on arXiv. In the meantime, please cite the technical report below.

BibTeX
@techreport{liang2026k12kgraph,
  title  = {K12-KGraph: A Curriculum-Aligned Knowledge Graph for
            Benchmarking and Training Educational LLMs},
  author = {Liang, Hao},
  year   = {2026},
  institution = {Peking University},
  url    = {https://github.com/haolpku/K12-Dataset}
}