Storage Architecture¶
Status: Rust Core — active development on main
Last Updated: 2026-05-31
Overview¶
GraphForge uses a pluggable storage provider model. No provider is the semantic owner of the query language. All providers implement a common Rust trait and are selected at runtime based on the use case.
pub trait StorageProvider: Send + Sync {
fn provider_name(&self) -> &'static str;
fn table_provider(&self, table: &QualifiedTable) -> Result<Arc<dyn TableProvider>, GfError>;
fn capabilities(&self) -> ProviderCapabilities;
}
Provider Role: Parquet¶
Parquet is the sole storage provider for the Rust core. It:
- Stores topology, property, and provenance tables as columnar Parquet files
- Carries GraphForge metadata at the file level (ontology version, IR version, query ID)
- Persists the compiled ontology runtime tables for rapid startup
Parquet file-level metadata:
graphforge.dataset_kind = "topology_nodes"
graphforge.ontology_version = "core-2026.05"
graphforge.writer_version = "0.5.0"
graphforge.ir_version = "0.1.0"
graphforge.query_id = "01J..."
graphforge.provenance_policy = "conservative_min"
The StorageProvider trait is designed to be extended with additional backends in future milestones. No other provider is in scope for v0.5.
Identity and Surrogate Keys¶
GraphForge uses a dual-key pattern for all first-class objects:
| Key | Type | Purpose |
|---|---|---|
UUID (*_uuid) |
FixedSizeBinary(16) — UUIDv7 |
Canonical stable identity. Globally unique. Immutable. Survives project merges, offline generation, and cross-analyst exchanges. |
Surrogate (*_id) |
UInt64 |
Execution-time optimization. Assigned at ingest/load time. Used for DataFusion join operations. Never exposed in public API results. |
Why UUIDv7¶
UUIDv7 (RFC 9562) is time-ordered within a millisecond, globally unique without coordination, fits in Arrow FixedSizeBinary(16), and supports offline generation on mobile devices or air-gapped systems. See refactor-v0.5.md §5 for the full rationale.
UUID→Surrogate mapping¶
The relational lowering layer maps node_uuid → node_id once at scan time. All DataFusion join operators use integer surrogates (node_id, edge_id, src_id, dst_id) for performance. Results project back to UUID columns before returning to the caller.
Rule: UUIDs appear in every public API result schema. Surrogates are execution-internal and must never appear in API outputs.
Objects requiring UUID identity¶
| Object | UUID column |
|---|---|
| Node (entity) | node_uuid |
| Edge (relationship) | edge_uuid |
| Document | doc_uuid |
| Provenance event | provenance_uuid |
| Analyst/User | analyst_uuid |
| Project | project_uuid |
| Workflow | workflow_uuid |
| Embedding | embedding_uuid |
| Source reference | source_uuid |
| Ranking output row | rank_uuid |
| Clustering output row | cluster_uuid |
| Generated artifact | artifact_uuid |
Storage Layout¶
GraphForge organises a project as a directory tree. The graph is one asset among several. Folders are capability modules — an absent folder means that capability is not yet enabled, not that it is misconfigured.
project/
├── graphforge.yaml # project manifest (includes capabilities: field)
├── ontology.yaml # domain ontology (optional)
│ # ── GRAPH LAYER ──
├── topology/ # hot path — identity + type only
│ ├── nodes.parquet
│ ├── generation.json # topology_generation counter (staleness signal for derived indexes)
│ └── edges/ # one file per relation type
│ ├── WORKS_AT.parquet
│ └── OWNS.parquet
├── properties/ # warm path — entity property columns
│ ├── Person.parquet
│ └── Organization.parquet
├── indexes/adjacency/ # derived traversal accelerator (ADR 0005)
│ # ── KNOWLEDGE LAYER ──
├── provenance/ # provenance events + lineage (Designed)
│ ├── events.parquet
│ └── lineage.parquet
├── knowledge/ # epistemic model — assertions, supersession, evidence, reasoning (ADR 0007, Designed)
│ ├── assertions.parquet
│ ├── supersession.parquet
│ ├── evidence.parquet
│ └── reasoning.parquet
│ # ── WORKBENCH LAYER ──
├── documents/ # raw source files
├── embeddings/ # vector stores per model
├── indexes/ # FTS indexes (Tantivy)
├── workflows/ # saved analysis pipelines
├── artifacts/ # ranked/clustered outputs
└── sync/ # changeset-based collaboration (post-v0.5)
A minimal project needs only topology/ + properties/ (graph layer). All other folders are
capability modules — absent means the capability is not enabled, not misconfigured. Folders group by
layer (ADR 0006); capabilities are additive.
Derived Indexes¶
The indexes/ folder holds derived, rebuildable acceleration structures. Nothing here is
canonical: every file under indexes/ can be reconstructed from topology/ (and, for FTS,
properties/) alone. An absent index is not an error — it means the accelerator has not been
built yet, and the engine falls back to building in memory on demand. See
ADR 0005.
indexes/adjacency/ — graph-native adjacency index¶
The adjacency index is a derived CSR (compressed sparse row) representation of the topology,
used by both the Cypher traversal path (variable-length Expand) and the analyst verbs
(rank/cluster/paths/analyze/similar). It is optional: absent ⇒ build in memory
on demand (today's behavior); present ⇒ load from disk. It is surrogate-keyed and never
changes results — only speed.
indexes/
└── adjacency/
├── index_manifest.parquet
├── WORKS_AT.out.csr
├── WORKS_AT.in.csr
├── OWNS.out.csr
├── OWNS.in.csr
├── _all.out.csr # union across relation types (for via=None)
└── _all.in.csr
The builder (gf_storage::adjacency::build_adjacency_index) writes one {out, in} pair per
relation type plus the _all union pair, then the manifest last. Relation names unusable
as file stems (path separators, .., the reserved _all) are skipped — those relations are
served by scan-build, but their rows still flow into the union index. The manifest is stamped
with the topology_generation read before the edge scan, so a concurrent topology write
mid-build can only make the result read as stale, never as falsely fresh.
index_manifest.parquet
| Column | Arrow type | Notes |
|---|---|---|
relation_type |
Utf8 |
Relation type name, or _all for the union index |
direction |
Utf8 |
"out" | "in" |
topology_generation |
UInt64 |
Generation counter the CSR was built from; bumped on every topology write |
built_at |
Timestamp(Microseconds, UTC) |
|
node_count |
UInt64 |
Number of source nodes covered (CSR row count) |
edge_count |
UInt64 |
Number of (edge, neighbor) entries |
CSR file (<REL_TYPE>.<dir>.csr) — single-batch Arrow IPC, one column, one row per
surrogate node_id ∈ 0..node_count:
| Column | Arrow type | Notes |
|---|---|---|
adjacency |
LargeList<Struct { edge_id: UInt64, neighbor_id: UInt64 }> |
Row i holds the adjacency entries of surrogate node_id = i, in CSR order |
This is the CSR structure in its idiomatic Arrow encoding — the two logical arrays cannot be
two top-level columns because a RecordBatch requires equal column lengths. The list's offsets
buffer is the CSR offsets array (length node_count + 1, Int64, starting at 0,
monotone), and the flattened struct child is the targets array (length edge_count):
neighbors of node_id = i are targets[offsets[i]..offsets[i+1]], zero-copy on read.
Conventions:
- Empty graph: a zero-row batch — logical
offsets == [0], empty targets. The offsets array is never empty. - Node with no neighbors: an empty list (
offsets[i] == offsets[i+1]). - CSR rows cover exactly
node_id ∈ 0..node_count; surrogates beyondnode_countsimply have no entries. - In-memory consumers (
gf_exec::AdjacencyProvider, #760 — scan-build viaScanBuildAdjacencyProvideruntil the persistent loader, #761) see the logicaloffsets/targetsmodel; the list encoding is a file-format detail (gf_storage::adjacency::CsrIndexon the storage side).
Rebuild and versioning semantics¶
- Source of truth. A CSR is always reconstructable from
topology/edges/<REL_TYPE>.parquetalone, deterministically. - Counter persistence. The project
topology_generationcounter lives attopology/generation.jsonas{"topology_generation": N}, written atomically (sibling temp + rename). An absent file is generation 0 — safe, because every binary that writesindexes/adjacency/also bumps, so no index manifest can predate the counter. The counter is machine-owned mutable state; it lives beside the topology it versions, not in the user-editablegraphforge.yaml. - Bump rule. The counter is bumped immediately before committing any staged batch
that rewrites
topology/nodes.parquetor any file undertopology/edges/(including_exploratory.parquet) — one bump per committed batch (gf_storage::commit_topology_aware). Property-only writes (properties/,edge_properties/) never bump: properties cannot change adjacency. - Crash-safety invariant (bump-before-commit). A crash after the bump but before or during the commit leaves the counter advanced over unchanged topology: the index merely looks stale and is rebuilt — one spurious rebuild, correct results. The reverse order would be unsound: new topology under the old counter makes a stale index look fresh. Spurious bumps are safe; missed bumps are not.
- Staleness detection. The provider compares the manifest's
topology_generationagainst the current value. A corrupt counter file or manifest marks the index always-stale, never fresh. - Fallback. On mismatch (or absent index), the provider scans the typed edge tables and builds the adjacency in memory — yielding identical results, only slower. A stale or missing index can therefore never cause incorrect output.
- Rebuild triggers. Lazy on first traversal when the
indexes/adjacency/capability is present, or explicit viaforge.index("adjacency", ...). Incremental rebuild (append-delta + compaction) is deferred to v0.5.1. - Determinism (R-ADJ-2). Full rebuild streams each typed edge file once;
outentries sort by(src_id, edge_id)andinentries by(dst_id, edge_id)— theedge_idtie-break makes the CSR bytes reproducible fromtopology/alone._all.{out,in}.csrare the same sorts over the union of all typed files plus_exploratory.parquet. The manifest'sbuilt_atis excluded from the determinism guarantee. - Build ordering. Builders write all CSR files first and
index_manifest.parquetlast, so a torn build reads as stale (absent/old manifest), never as fresh. - Loader semantics (
gf_exec::PersistentAdjacencyProvider, #761). Freshness = manifest non-empty AND every row'stopology_generationequals the current counter. Fresh + row present ⇒ load (adjacency=hit); stale or torn ⇒ lazy rebuild, then serve; fresh but no row for the requested relation ⇒ scan-build without rebuild (rebuilding cannot add an unknown relation — prevents a rebuild-per-query loop); corrupt counter or manifest ⇒ always-stale scan-build without rebuild (stamping needs a readable counter); capability dir absent ⇒ scan-build (adjacency=building). Typed-mode"*"bypasses the index entirely (pre-existing empty-scan behavior; reported asbuilding, never a false miss). A build or load failure never fails the query — only its speed. - Direction.
outandinCSRs are stored separately; undirected traversal unions them. In exploratory mode,_exploratory.parquetrows are routed by theirrel_type_namecolumn.
indexes/<LABEL>/tantivy/ — full-text search index¶
Full-text indexes (Tantivy) are also derived and rebuildable from properties/. See the
Find milestone (forge.find / forge.index).
Graph Fact Schema¶
Topology layer (hot path)¶
Graph traversal reads only the topology layer. No property columns are read unless the query explicitly projects them.
topology/nodes.parquet
| Column | Arrow type | Notes |
|---|---|---|
node_uuid |
FixedSizeBinary(16) |
UUIDv7 — canonical stable identity |
node_id |
UInt64 |
Local surrogate — DataFusion join key |
type_id |
UInt32 |
Ontology entity type integer |
created_at |
Timestamp(Microseconds, UTC) |
|
updated_at |
Timestamp(Microseconds, UTC) |
topology/edges/TYPENAME.parquet (one file per relation type)
| Column | Arrow type | Notes |
|---|---|---|
edge_uuid |
FixedSizeBinary(16) |
UUIDv7 |
src_uuid |
FixedSizeBinary(16) |
References node_uuid |
dst_uuid |
FixedSizeBinary(16) |
References node_uuid |
edge_id |
UInt64 |
Local surrogate |
src_id |
UInt64 |
Local surrogate — DataFusion join key |
dst_id |
UInt64 |
Local surrogate — DataFusion join key |
created_at |
Timestamp(Microseconds, UTC) |
|
confidence |
Float64 |
0.0–1.0 |
provenance_uuid |
FixedSizeBinary(16) |
References provenance/events.parquet |
Typed edge tables (one Parquet file per relation type) replace the unified edge_facts table. This enables direct scans on a single relation type without filtering, yielding significant I/O savings at 100M+ edges. See refactor-v0.5.md §7 for performance analysis.
Properties layer (warm path)¶
properties/ENTITY_TYPE.parquet (one file per entity type, columns per ontology)
| Column | Arrow type | Notes |
|---|---|---|
node_uuid |
FixedSizeBinary(16) |
Join key back to topology/nodes.parquet |
| (property columns) | (per ontology) | e.g. name Utf8, age Int64, email Utf8 |
Property access is a join: topology/nodes JOIN properties/PERSON ON node_uuid. DataFusion handles this as a hash join. The separation allows graph traversal to skip property I/O entirely.
Provenance layer (Knowledge Layer)¶
Layer + status (ADR 0006).
provenance/is part of the knowledge layer, not the graph layer. Graph rows carry aprovenance_uuidreference; the events/lineage live here. Status: Designed — these tables are schema-defined but not yet written (the write path is owned by the Knowledge Layer Foundation milestone). The epistemic model (ADR 0007) addsknowledge/tables (below) on top of this layer.
provenance/events.parquet
| Column | Arrow type | Notes |
|---|---|---|
provenance_uuid |
FixedSizeBinary(16) |
UUIDv7 |
kind |
Utf8 |
"ingestion" | "inference" | "assertion" | "merge" |
source_ref |
FixedSizeBinary(16) |
UUID of source document or system |
analyst_uuid |
FixedSizeBinary(16) |
UUID of the contributing analyst |
rule_id |
Utf8 |
Ontology inference rule ID (nullable) |
confidence |
Float64 |
Confidence score for this event |
query_id |
Utf8 |
UUID of the query that produced this fact |
created_at |
Timestamp(Microseconds, UTC) |
provenance/lineage.parquet
| Column | Arrow type | Notes |
|---|---|---|
parent_uuid |
FixedSizeBinary(16) |
Upstream provenance event |
child_uuid |
FixedSizeBinary(16) |
Downstream provenance event |
role |
Utf8 |
"input" | "derived" | "merged" |
weight |
Float64 |
Contribution weight |
Knowledge layer — epistemic tables (knowledge/)¶
Layer + status (ADR 0006 / ADR 0007). A capability folder reserved for the epistemic model — how an analyst's understanding evolves. Status: Designed (v0.5.0 target, Full scope; owned by the Epistemic Model milestone). All tables reference graph objects by
*_uuid; none alter graph topology, and graph-native query results never depend on them. The folder is a capability module — absent means the epistemic model is simply not enabled, and the graph behaves exactly as before. Full schemas in ADR 0007.
knowledge/
├── assertions.parquet # claims about graph objects: status (hypothesis|supported|refuted|
│ # superseded|disputed), confidence, hypothesis_group, bitemporal
│ # valid_from/valid_to + recorded_at/retracted_at — never deleted
├── supersession.parquet # superseding_uuid → superseded_uuid (+ reason) — append-only history
├── evidence.parquet # assertion_uuid → source_uuid, role (supports|contradicts|context)
└── reasoning.parquet # assertion_uuid → why concluded / why an alternative was rejected
Key properties (preservation-over-deletion): refuting or superseding a claim preserves the
prior assertion, its evidence, and its reasoning; competing assertions about the same question
coexist via a shared hypothesis_group; bitemporal valid-time answers "what did we believe, when,
and why did it change?" Bitemporal querying is capability-gated and off by default, with an
assertion-time-only fallback (see ADR 0007).
Ontology Runtime¶
The ontology is a runtime-loadable knowledge schema, not Rust structs generated into the binary. Three representations serve different purposes:
| Format | Purpose |
|---|---|
| YAML / JSON | Human-authored ontology definitions (Serde-based load) |
| Arrow tables | Compiled execution format — cheap joins during binding and planning |
| Parquet | Persisted for rapid startup or reproducible deployments |
Ontology authoring format (YAML)¶
ontology_id: core
version: "2026.05"
entity_types:
- name: Person
abstract: false
- name: Employee
parent: Person
relation_types:
- name: MANAGES
src: Employee
dst: Employee
inverse: MANAGED_BY
semantic:
transitive: false
symmetric: false
functional: false
properties:
- owner: Person
name: name
type: utf8
nullable: false
constraints:
- owner: Employee
kind: unique_property
expr:
property: employee_id
At load time this compiles into Arrow lookup tables keyed by integer type IDs. String-heavy lookups during planning become O(1) integer comparisons.
Ontology runtime tables¶
| Table | Purpose |
|---|---|
ontology_meta |
Identity, version, IR compatibility range, checksum |
entity_types |
Node classes and inheritance DAG (acyclicity enforced at load) |
relation_types |
Edge classes, endpoint type constraints, inverse pairs |
property_types |
Name, owner, value type, nullability, cardinality |
type_constraints |
Validation rules (unique, required, range) |
cardinality_rules |
Endpoint multiplicity (min/max per relation type) |
semantic_flags |
transitive, symmetric, reflexive, functional, acyclic |
aliases |
Human-facing and deprecated names |
migrations |
Versioned ontology upgrade transforms |
Ontology versioning¶
Two independent version axes:
| Axis | Meaning |
|---|---|
ontology_version |
Meaning of types and rules — changes when the schema evolves |
ir_version |
Runtime/compiler contract — changes when the IR format changes |
A new ontology version does not require a new IR version, and vice versa. Persisted datasets record the ontology_version used to write them. Arrow schema metadata carries both versions through IPC and Parquet round-trips.
Validation model¶
| Level | When | Examples |
|---|---|---|
| Ontology-load | On file/table load | Duplicate names, missing parents, inheritance cycles, bad inverse references |
| Write-time | On CREATE, MERGE, batch ingest |
Unknown property, wrong value type, illegal endpoint type, cardinality overflow |
| Query-time | During binding/planning | Unknown labels/types/properties, illegal pattern shape, ambiguous property resolution |
Serialization Systems¶
Never mix these two systems:
| System | Purpose | Format |
|---|---|---|
Arrow / Parquet (gf-storage) |
Graph topology, properties, provenance | Binary columnar (Arrow IPC / Parquet) |
JSON / YAML (gf-ontology) |
Ontology definitions and metadata | Text (human-readable, validatable) |
Graph data → Arrow/Parquet. Ontology/metadata → JSON or YAML. Arrow schema metadata carries version and provenance annotations across language boundaries.
Two-Mode Graph Instances¶
// In-memory (fast, volatile)
let forge = GraphForge::new(None)?;
// Persistent (project directory)
let forge = GraphForge::new(Some("path/to/project/"))?;
The storage layer is transparent to all API surfaces.
References¶
- Architecture Overview — workspace layout and provider trait
- Architecture Refactor v0.5 — UUID identity model, typed edge tables, project structure
- Execution Model — how providers connect to DataFusion
- ADR 0002: Rust Core — Parquet-as-primary and provider strategy