Data Format
About 1222 wordsAbout 4 min
2026-03-30
This page describes the data formats accepted by DataMind modules, for developers building upstream data pipelines.
Data Profile mechanism
DataMind can host multiple knowledge bases; each one is a profile, selected with the DATA_PROFILE environment variable. Data and indexes are fully isolated per profile.
Directory layout
data/
├── profiles/ ← knowledge bases (per profile)
│ ├── default/
│ │ ├── chunks/*.jsonl ← pre-chunked RAG data
│ │ ├── triplets/*.jsonl ← GraphRAG triplets
│ │ ├── tables/*.sql ← Database SQL files
│ │ ├── images/ ← multimodal images
│ │ └── *.txt / *.md / ... ← raw documents
│ └── {custom_profile}/
├── bench/ ← question sets (shared)
├── skills/ ← skill docs (shared)
└── bench_raw/ ← raw download cache
storage/
├── default/ ← indexes for default
│ ├── chroma.sqlite3
│ ├── demo.db
│ └── graph/
└── {profile}/End-to-end data flow
Upstream data pipeline
│
├── Unstructured docs ──→ profiles/{profile}/ ───→ RAG mode A + GraphRAG mode A
├── Pre-chunked data ──→ profiles/{profile}/chunks/*.jsonl → RAG mode B
├── Pre-built triplets ─→ profiles/{profile}/triplets/*.jsonl → GraphRAG mode B
├── SQL DDL/DML ───────→ profiles/{profile}/tables/*.sql → Database
├── Skills / SOP docs ──→ data/skills/*.md → Skills
└── Structured data ───→ SQLite file → DatabaseThe four modules are independent; you can prepare one or several.
RAG chunks (JSONL)
Location: data/profiles/{profile}/chunks/*.jsonl
Full field list (including multimodal):
| Field | Type | Required | Description |
|---|---|---|---|
text | string | Yes | Chunk text (may be empty for pure image modality) |
metadata | object | No | Arbitrary key-value pairs |
image_path | string | No | Image path relative to the profile directory |
image_description | string | No | VLM-generated image caption |
modality | string | No | text / image / text_image |
metadata is not used in similarity scoring but is passed to the LLM as context.
GraphRAG triplets (JSONL)
Location: data/profiles/{profile}/triplets/*.jsonl
{"subject": "Alice", "relation": "works_at", "object": "ACME Corp"}
{"subject": "Alice", "relation": "works_at", "object": "ACME Corp", "subject_type": "Person", "object_type": "Organization", "confidence": 0.95, "source": "doc1.md"}| Field | Type | Required | Description |
|---|---|---|---|
subject | string | Yes | Subject entity |
relation | string | Yes | Relation type |
object | string | Yes | Object entity |
subject_type | string | No | Subject type (e.g. "Person"), default "entity" |
object_type | string | No | Object type (e.g. "Organization"), default "entity" |
subject_properties | object | No | Extra subject fields (multimodal hook, e.g. {"image": "img/a.png"}) |
object_properties | object | No | Extra object fields (multimodal hook) |
confidence | float | No | Confidence score (default 1.0) |
source | string | No | Source identifier |
Database SQL
Location: data/profiles/{profile}/tables/*.sql
SQL files run in filename sort order—for example 01_schema.sql then 02_data.sql.
You can also ship a SQLite file directly at storage/{profile}/demo.db.
Benchmark questions (JSONL)
Location: data/bench/*.jsonl
{"question": "What is the core idea of RAG?"}
{"question": "When was X born?", "reference_answer": "1982", "question_id": "q_001"}| Field | Type | Required | Description |
|---|---|---|---|
question | string | Yes | Question text |
reference_answer | string | No | Ground truth for evaluation |
question_id | string | No | Unique id |
Skill documents (Markdown)
Location: data/skills/*.md
Standard Markdown files. Each file is indexed and retrievable via skill_search.
Converting public datasets
When using datasets such as A-RAG:
Download
pip install huggingface_hub
python -c "
from huggingface_hub import hf_hub_download
for f in ['chunks.json', 'questions.json']:
hf_hub_download('Ayanami0730/rag_test', f'2wikimultihop/{f}',
repo_type='dataset', local_dir='data/bench_raw')
"Raw format
chunks.json — JSON array of "id:text" strings:
["0:teutberga (died 11 november...", "1:##lus the little pfalzgraf..."]questions.json — JSON array of objects with question, answer, etc.:
[{"id": "xxx", "question": "When did X happen?", "answer": "1982", ...}]Convert
Write a script to convert these into DataMind JSONL: valid {"text": "..."} lines for chunks and {"question": "..."} lines for questions.
One-shot export script template
Full template (aligned with docs/data.md in the project) for upstream pipelines:
"""
Data export script for preprocessing pipelines.
Writes prepared data under DataMind data/ and storage/.
"""
import os
import json
import sqlite3
DATAMIND_ROOT = "/path/to/DataMind"
PROFILE = "default" # target profile name
DATA_DIR = os.path.join(DATAMIND_ROOT, "data", "profiles", PROFILE)
STORAGE_DIR = os.path.join(DATAMIND_ROOT, "storage", PROFILE)
def export_rag_documents(documents: list[dict]):
"""
Mode A: export raw RAG documents (app chunks + embeds automatically).
Args:
documents: [{"title": "...", "content": "...", "category": "..."}]
"""
for doc in documents:
category_dir = os.path.join(DATA_DIR, doc.get("category", ""))
os.makedirs(category_dir, exist_ok=True)
filepath = os.path.join(category_dir, f"{doc['title']}.md")
with open(filepath, "w", encoding="utf-8") as f:
f.write(doc["content"])
print(f"[Export] RAG: wrote {len(documents)} documents to {DATA_DIR}")
def export_rag_chunks(chunks: list[dict], filename: str = "chunks.jsonl"):
"""
Mode B: export pre-chunked RAG JSONL (embed only, skip splitting).
Args:
chunks: [{"text": "...", "metadata": {"source": "...", ...}}]
filename: output file name
"""
chunks_dir = os.path.join(DATA_DIR, "chunks")
os.makedirs(chunks_dir, exist_ok=True)
filepath = os.path.join(chunks_dir, filename)
with open(filepath, "w", encoding="utf-8") as f:
for chunk in chunks:
f.write(json.dumps(chunk, ensure_ascii=False) + "\n")
print(f"[Export] RAG Chunks: wrote {len(chunks)} chunks to {filepath}")
def export_graph_triplets(triplets: list[dict]):
"""
Export GraphRAG triplets.
Args:
triplets: [{"subject": "...", "relation": "...", "object": "..."}]
"""
triplet_dir = os.path.join(DATA_DIR, "triplets")
os.makedirs(triplet_dir, exist_ok=True)
filepath = os.path.join(triplet_dir, "knowledge_graph.jsonl")
with open(filepath, "w", encoding="utf-8") as f:
for t in triplets:
f.write(json.dumps(t, ensure_ascii=False) + "\n")
print(f"[Export] GraphRAG: wrote {len(triplets)} triplets")
def export_skill_documents(skills: list[dict]):
"""
Export Skills markdown documents.
Args:
skills: [{"title": "...", "content": "Markdown..."}]
"""
skills_dir = os.path.join(DATAMIND_ROOT, "data", "skills")
os.makedirs(skills_dir, exist_ok=True)
for skill in skills:
filepath = os.path.join(skills_dir, f"{skill['title']}.md")
with open(filepath, "w", encoding="utf-8") as f:
f.write(skill["content"])
print(f"[Export] Skills: wrote {len(skills)} files to {skills_dir}")
def export_database_tables(tables: dict):
"""
Export Database tables into demo.db.
Args:
tables: {
"table_name": {
"columns": {"col1": "TEXT", "col2": "INTEGER", ...},
"rows": [{"col1": "val1", "col2": 123}, ...]
}
}
"""
os.makedirs(STORAGE_DIR, exist_ok=True)
db_path = os.path.join(STORAGE_DIR, "demo.db")
conn = sqlite3.connect(db_path)
cursor = conn.cursor()
for table_name, table_data in tables.items():
columns = table_data["columns"]
col_defs = ", ".join(f"{name} {dtype}" for name, dtype in columns.items())
cursor.execute(f"CREATE TABLE IF NOT EXISTS {table_name} ({col_defs})")
if table_data["rows"]:
col_names = list(columns.keys())
placeholders = ", ".join(["?"] * len(col_names))
col_str = ", ".join(col_names)
for row in table_data["rows"]:
values = [row.get(c) for c in col_names]
cursor.execute(
f"INSERT OR REPLACE INTO {table_name} ({col_str}) VALUES ({placeholders})",
values,
)
conn.commit()
conn.close()
print(f"[Export] Database: wrote {len(tables)} tables to {db_path}")
# ---- Example ----
if __name__ == "__main__":
export_rag_documents([
{"title": "Product", "content": "# Product\n\n...", "category": "product"},
{"title": "Architecture", "content": "# Architecture\n\n...", "category": "tech"},
])
export_rag_chunks([
{"text": "LlamaIndex is a Python framework...", "metadata": {"source": "doc.md"}},
{"text": "Vector search matches by semantic similarity...", "metadata": {"source": "doc.md"}},
])
export_graph_triplets([
{"subject": "DataMind", "relation": "uses", "object": "LlamaIndex"},
{"subject": "LlamaIndex", "relation": "built_on", "object": "Python"},
])
export_skill_documents([
{"title": "deploy_runbook", "content": "# Deploy\n\n## When\n..."},
{"title": "troubleshooting", "content": "# Troubleshooting\n\n## Steps\n..."},
])
export_database_tables({
"products": {
"columns": {"id": "INTEGER PRIMARY KEY", "name": "TEXT", "price": "REAL"},
"rows": [
{"id": 1, "name": "Laptop", "price": 6999.0},
{"id": 2, "name": "Keyboard", "price": 399.0},
],
}
})Data refresh strategy
| Module | Incremental | Full rebuild |
|---|---|---|
| RAG | Add files under the profile, then click Rebuild index | Remove storage/{profile}/ and restart |
| GraphRAG | Full rebuild only for now | Remove storage/{profile}/graph/ and restart |
| Skills | Add .md under data/skills/, then Rebuild index | Drop the skills collection and restart |
| Database | Editing storage/{profile}/demo.db takes effect immediately | Delete the .db and restart |
API / token usage:
- After changing
DATA_PROFILE, indexes use that profile’sstorage/subtree; you usually do not need to delete indexes by hand. - RAG mode A (raw docs, auto chunking) and GraphRAG mode A (LLM extraction) call LLM / embedding APIs; large corpora can cost many tokens.
- RAG mode B (pre-chunked JSONL) mainly uses embedding APIs, not an LLM for splitting.
- GraphRAG mode B (pre-built triplets) uses no LLM API tokens—direct graph import.
- Skills indexing mainly uses embedding APIs.
- Prefer rebuilding after data stabilizes.
