About

I am Hao Liang (梁昊), a Ph.D. candidate at Center for Data Science, Peking University, jointly affiliated with Zhongguancun Academy. I am fortunate to be supervised by Prof. Wentao Zhang and Prof. Bin Dong, and to work closely with Prof. Bin Cui and Prof. Weinan E. I am also a research intern at Tencent HY under the Qingyun Program (青云计划), focusing on pretraining data preparation.

Prior to this, I received my bachelor’s degree from Beijing Institute of Technology, where I was awarded the Xu Teli Scholarship (the highest honor of BIT) and the National Scholarship. I also visited the University of Oxford, working with Prof. Ismail Ilkan Ceylan and Prof. Michael Bronstein.

News

Research Interests

My research focuses on Data-Centric AI along four directions. For broader context, see our surveys on text-centric and multimodal perspectives.

  1. Data Infrastructure — Scalable systems and pipelines for data preparation and data–model iterative training at scale. A conceptual overview appears in Towards Next-Generation LLM Training: From the Data-Centric Perspective; open-source stacks include DataFlow and DataFlex.

  2. Data Agents — Autonomous agents that curate, transform, and manage data intelligently. Examples include DataFlow-Skills and Text2SQL-Flow.

  3. Math for Data — How data quality, attribution, and scaling shape model behavior, connecting mathematical analysis with empirical study.

  4. Data for Science — Curating scientific corpora for training and evaluation, including mathematics and formal-verification data (e.g., Lean). Representative work includes MathScape, MM-Verify, and Let’s Verify Math Questions Step by Step.

I have published 9 first-author / co-first-author papers at CCF-A venues and received the Sa Shixuan Best Student Paper Award at NDBC.

Open-Source Contributions

  • DataFlow: Lead designer of this open-source data processing framework, which has received 3,000+ GitHub Stars. DataFlow achieved 1st place in the ICML SeePhy Challenge and 1st place in the Zhiyuan LIC Challenge.
  • DataFlex: Lead designer of this open-source data-model iterative training framework.
  • LLaMA-Factory: Contributed to the data module design (65k+ Stars).
  • CAMEL: Integrated DataFlow into CAMEL’s data pipeline (16k+ Stars).

Honors & Awards

  • President’s Scholarship, Peking University
  • Industrial Bank Scholarship (兴业奖学金), Peking University
  • Xu Teli Scholarship (徐特立奖学金), Beijing Institute of Technology (Highest Honor)
  • National Scholarship, Beijing Institute of Technology
  • Sa Shixuan Best Student Paper Award (萨师煊优秀学生论文奖), NDBC