K12-KGraph: Curriculum-Aligned Knowledge Graph for Educational LLMs

01 Overview

Large language models have strong factual recall but often fail on curriculum cognition: which concepts come first, which skills a problem tests, and where a topic sits in the textbook hierarchy. K12-KGraph is a three-part resource that grounds evaluation and training in structured textbook knowledge.

K12-KGraph. 10,685 nodes and 23,278 edges extracted from official PEP textbooks with OCR, schema-constrained LLM extraction, and human verification.
K12-Bench. 23,640 multi-select questions across five curriculum tasks: Ground, Prereq, Neighbor, Evidence, and Locate.
K12-Train. 2,267 KG-guided SFT pairs that outperform same-sized subsets of 8 mainstream SFT corpora on GaokaoBench and EduEval.

Schema of K12-KGraph: 7 node types & 9 relations grounded in textbook structure.

02 Three Resources

K12-KGraph

Knowledge Graph

A curriculum-aligned KG over four subjects, distilled from the official PEP textbook corpus.

Nodes: Concept, Skill, Experiment, Exercise, Section, Chapter, Book
Relations: is_a, prerequisites_for, relates_to, verifies, tests_concept, tests_skill, appears_in, is_part_of, leads_to
Quality: OCR + schema-constrained extraction + structural validation + expert review

K12-Bench

Evaluation

23,640 multi-select questions that isolate five dimensions of curriculum cognition.

Ground · which concepts does the question assume?
Prereq · what must be learned first?
Neighbor · what is structurally adjacent?
Evidence · which sections justify the answer?
Locate · where in the textbook does this appear?

K12-Train

SFT Data

2,267 compact, KG-guided QA pairs designed to teach curriculum-aware reasoning.

Source: Synthesized on top of K12-KGraph subgraphs
Efficiency: Beats 8 mainstream SFT corpora at equal budget
Validated on: GaokaoBench & EduEval across two backbones

03 K12-Bench at a Glance

Each K12-Bench question is derived deterministically from a K12-KGraph subgraph, which keeps answer keys auditable and makes the benchmark difficulty controllable by subject, grade, and task family.

The benchmark is balanced across four subjects and five task families, giving a compact yet rigorous probe of curriculum cognition in LLMs.

K12-Bench distribution across subjects, tasks, and difficulty.

Coverage of K12-Bench across subjects and the five curriculum tasks.

04 Explore the Data

Real samples from the dataset — click a tab to switch the view.

Seven node types

Concepts, skills, exercises, and experiments are anchored within the textbook hierarchy: Book → Chapter → Section.

Concept

正数

大于 0 的数叫做正数。有时为了明确意义，也可以在前面加 "+" 号。

Importance: 掌握 Examples: 3, +0.5

Skill

用正负数表示实际问题中的相反量

根据量的方向或意义（如增减、收入支出、上升下降）选择用正数或负数表达。

Anchored via tests_skill

Exercise

小明体重增加 2kg，小华减少 1kg，小强无变化。写出这个月的体重增长值。

Answer: 小明 2kg · 小华 -1kg · 小强 0kg

Difficulty: 1 Type: 应用题

Experiment

实验 4-1 锌铜原电池中盐桥作用的探究

通过对比有/无盐桥时的电流现象，验证盐桥在保持电中性、维持电路闭合中的作用。

verifies → 盐桥

Book

七年级上册

数学 · 人教版

Chapter

第一章有理数

contains 15 sections

Section

第一节正数和负数

Leaf of the hierarchy

Nine relation types

Every edge is traceable to textbook evidence. Hover a card to see the source passage that justifies it.

is_ataxonomy

整数有理数

"整数和分数统称为有理数。"

prerequisites_forlearning order

正数相反意义的量

"如果一个问题中出现相反意义的量，可以用正数和负数分别表示它们。"

relates_toassociation

正数负数

"正数与负数是相对的概念；'负'与'正'相对。"

tests_conceptexercise → concepts

Exercise #1 正数负数 0

One exercise probes multiple concepts at once.

tests_skillexercise → skill

Exercise #1 用正负数表示相反量

Maps exercises to the skill they practise.

verifiesexperiment → law

实验 4-1 盐桥作用

Links lab experiments to the concept they verify.

appears_ingrounding

正数第一节正数和负数

Anchors every node back to its textbook section.

is_part_ofhierarchy

第一节第一章七年级上册

Section → Chapter → Book spine.

leads_toprogression

有理数实数

Cross-chapter conceptual progression.

Five task families on K12-Bench

Each question is deterministically derived from a KG subgraph, so answer keys are auditable and distractors are systematic.

01

Ground

Which concepts/skills does this problem actually test?

Physics

【写出 Po-210 的 β 衰变方程。】这道题考察了什么核心方法？

A 光电效应方程分析

B 统计分析方法

C 能量守恒分析

D 核反应方程书写方法

02

Prereq

What must be learned before this concept?

Physics

要掌握物理中的【熵】，应先具备哪些基础？

A 微观态与宏观态

B 熵增加原理

C 热力学第二定律

D 功与热的等效性

03

Neighbor

Which concepts are structurally adjacent in the KG?

Biology

以下哪些概念与生物学中的【多倍体】直接相关（包括分类关系或紧密关联）？

A 三倍体不育

B 四倍体

C 染色体组

D 人工诱导多倍体

04

Evidence

Which textbook experiment supports this claim?

Chemistry

以下哪些实验可以验证【盐桥】？

A 牺牲阳极法电解实验

B 实验 4-1 锌铜原电池中盐桥作用的探究

C 铁钉的吸氧腐蚀实验

D CuCl₂ 溶液的电解实验

05

Locate

Where in the textbook does this appear first?

Physics

学生第一次学习物理中的【熵】是在以下哪个章节？

A 第二节热力学第一定律

B 第四节热力学第二定律

C 第三节能量守恒定律

D 第一节功、热和内能的改变

KG-guided SFT pairs

Each training example anchors a node (definition-style) or an edge (relational reasoning) from K12-KGraph.

Node-centric anchor: 有理数

Question

什么是有理数？它和整数、分数之间是什么关系？

Answer

整数和分数统称为有理数。也就是说，整数和分数都是有理数的子集，它们合在一起构成有理数这一更大的类别。

Edge-centric anchor: 微观态与宏观态 prerequisites_for 熵

Question

在学习"熵"之前，为什么应该先掌握"微观态与宏观态"？

Answer

熵从微观角度描述了一个宏观态对应的微观态数目的多少。没有"微观态与宏观态"这一前置概念，就无法从统计力学的角度理解熵所度量的"无序程度"。

Node-centric anchor: 秒

Question

秒是什么时间单位？

Answer

秒是一种计量很短时间的基本单位。它是我们用来测量瞬间或短暂事件的时间单位。

Edge-centric anchor: 毫米 is_a 厘米

Question

为什么说毫米是厘米的更小单位？

Answer

因为厘米是长度单位，毫米也是长度单位，且 1 厘米等于 10 毫米，所以毫米是厘米的更小单位。

05 Key Findings

46%

Curriculum cognition is an open gap. Even the strongest open-source model only reaches roughly 46% exact match on K12-Bench, despite near-saturated factual recall on standard K-12 knowledge tests.

2.3k

Sample-efficient training. About 2,300 K12-Train examples surpass equally sized subsets of eight mainstream SFT corpora on both GaokaoBench and EduEval.

2×

Consistent across backbones. Gains transfer across two model families (Qwen and LLaMA-based) and scale from 7B to larger instruction-tuned checkpoints.

Takeaway. Grounding both evaluation and supervision in a curriculum-aligned KG is a promising recipe for curriculum-aware educational AI, well beyond the Chinese K–12 setting.

06 Experimental Results

K12-Train is compared against eight mainstream instruction-tuning corpora under a matched 2,300-sample budget on two base models. All results in one place.

Zero-shot, answer-only. Exact-match (EM) over 23,640 multi-select items across 5 task families. Even the strongest model reaches just 57.1%.

Open-source models

Meta-LLaMA-3-8B-Instruct

7.2%

GLM-4.7-Flash

31.7%

Ministral-3-14B-Instruct

37.5%

Qwen3-32B

42.6%

Gemma-4-31B-IT

46.4%

Proprietary APIs

GPT-4o

31.1%

GPT-5-mini

31.7%

GPT-5.2

42.8%

Gemini-2.5-Flash

48.3%

Gemini-3-Flash

57.1%

Open-source Proprietary Best in family x-axis: overall Exact-Match on K12-Bench

Total score across 10 subjects after SFT on 2,300 samples from each corpus. K12-Train wins on both backbones.

Qwen3-4B-Base

Total / 1,200

Base (no SFT)

445.4

Qwen3-4B-Instruct

895.4

+ 2,300 SFT samples from:

LMSYS

782.4

UltraChat

902.6

WizardLM

922.8

Tulu-3-SFT

951.7

Infinity-Instruct

953.7

SmolTalk

963.6

OpenHermes-2.5

966.8

DataFlow-10K

985.9

K12-Train (ours)

1009.96

+24.1 over strongest baseline (DataFlow) · +114.6 over Qwen3-4B-Instruct

Llama3.1-8B-Base

Total / 1,200

Base (no SFT)

130.8

Llama3.1-8B-Instruct

404.5

+ 2,300 SFT samples from:

LMSYS

402.6

SmolTalk

555.6

Infinity-Instruct

573.3

OpenHermes-2.5

580.2

DataFlow-10K

585.0

Tulu-3-SFT

587.0

UltraChat

592.2

WizardLM

593.1

K12-Train (ours)

625.49

+32.4 over strongest baseline (WizardLM) · +220.9 over Llama3.1-8B-Instruct

Average across 18 educational sub-tasks covering 5 capability dimensions (Application / Ethics / Memory / Reasoning / Understanding).

Qwen3-4B-Base

EduEval Avg / 100

Base (no SFT)

34.5

Qwen3-4B-Instruct

66.30

+ 2,300 SFT samples from:

LMSYS

64.68

UltraChat

65.44

DataFlow-10K

65.57

OpenHermes-2.5

66.01

Infinity-Instruct

66.04

SmolTalk

66.10

Tulu-3-SFT

66.27

WizardLM

66.70

K12-Train (ours)

66.76

Best average + strongest Ethics scores among all 2.3k-budget runs.

Llama3.1-8B-Base

EduEval Avg / 100

Base (no SFT)

20.98

Llama3.1-8B-Instruct

27.31

+ 2,300 SFT samples from:

LMSYS

29.85

UltraChat

31.87

DataFlow-10K

34.53

Infinity-Instruct

37.33

SmolTalk

37.64

Tulu-3-SFT

37.68

WizardLM

38.58

OpenHermes-2.5

40.05

K12-Train (ours)

40.90

+0.85 over strongest baseline (OpenHermes) · +13.59 over Llama3.1-8B-Instruct

Takeaway. At an equal 2.3k-sample budget, K12-Train is the only corpus that wins on both GaokaoBench and EduEval and on both backbones, pointing to curriculum-grounded synthesis as the right recipe for educational SFT.

07 Example

Example K12-Bench question and K12-Train pair derived from a subgraph.

An end-to-end example: KG subgraph → multi-select benchmark item → KG-guided training pair.

08 Access & Licensing

Hugging Face

Graph, benchmark, and training splits in one repo.

Open dataset

GitHub

Construction pipeline, evaluation scripts, training recipes.

View code

License

Dataset: CC BY-NC-SA 4.0
Code: MIT

Research use; see repo for full terms.

09 Paper & Citation

A preprint will be released on arXiv. In the meantime, please cite the technical report below.

BibTeX

@techreport{liang2026k12kgraph,
  title  = {K12-KGraph: A Curriculum-Aligned Knowledge Graph for
            Benchmarking and Training Educational LLMs},
  author = {Liang, Hao},
  year   = {2026},
  institution = {Peking University},
  url    = {https://github.com/haolpku/K12-Dataset}
}