Large language models have strong factual recall but often fail on curriculum cognition: which concepts come first, which skills a problem tests, and where a topic sits in the textbook hierarchy. K12-KGraph is a three-part resource that grounds evaluation and training in structured textbook knowledge.
A curriculum-aligned KG over four subjects, distilled from the official PEP textbook corpus.
23,640 multi-select questions that isolate five dimensions of curriculum cognition.
2,267 compact, KG-guided QA pairs designed to teach curriculum-aware reasoning.
Each K12-Bench question is derived deterministically from a K12-KGraph subgraph, which keeps answer keys auditable and makes the benchmark difficulty controllable by subject, grade, and task family.
The benchmark is balanced across four subjects and five task families, giving a compact yet rigorous probe of curriculum cognition in LLMs.
Real samples from the dataset — click a tab to switch the view.
Concepts, skills, exercises, and experiments are anchored within the textbook hierarchy: Book → Chapter → Section.
Every edge is traceable to textbook evidence. Hover a card to see the source passage that justifies it.
is_ataxonomyprerequisites_forlearning orderrelates_toassociationtests_conceptexercise → conceptstests_skillexercise → skillverifiesexperiment → lawappears_ingroundingis_part_ofhierarchyleads_toprogressionEach question is deterministically derived from a KG subgraph, so answer keys are auditable and distractors are systematic.
Each training example anchors a node (definition-style) or an edge (relational reasoning) from K12-KGraph.
prerequisites_for
熵
is_a
厘米
K12-Train is compared against eight mainstream instruction-tuning corpora under a matched 2,300-sample budget on two base models. All results in one place.
Zero-shot, answer-only. Exact-match (EM) over 23,640 multi-select items across 5 task families. Even the strongest model reaches just 57.1%.
Total score across 10 subjects after SFT on 2,300 samples from each corpus. K12-Train wins on both backbones.
Average across 18 educational sub-tasks covering 5 capability dimensions (Application / Ethics / Memory / Reasoning / Understanding).
Graph, benchmark, and training splits in one repo.
Open datasetConstruction pipeline, evaluation scripts, training recipes.
View code
Dataset: CC BY-NC-SA 4.0
Code: MIT
A preprint will be released on arXiv. In the meantime, please cite the technical report below.
@techreport{liang2026k12kgraph,
title = {K12-KGraph: A Curriculum-Aligned Knowledge Graph for
Benchmarking and Training Educational LLMs},
author = {Liang, Hao},
year = {2026},
institution = {Peking University},
url = {https://github.com/haolpku/K12-Dataset}
}