Skip to content

Storage Architecture

Status: Rust Core — active development on main Last Updated: 2026-05-31


Overview

GraphForge uses a pluggable storage provider model. No provider is the semantic owner of the query language. All providers implement a common Rust trait and are selected at runtime based on the use case.

pub trait StorageProvider: Send + Sync {
    fn provider_name(&self) -> &'static str;
    fn table_provider(&self, table: &QualifiedTable) -> Result<Arc<dyn TableProvider>, GfError>;
    fn capabilities(&self) -> ProviderCapabilities;
}

Provider Role: Parquet

Parquet is the sole storage provider for the Rust core. It:

  • Stores topology, property, and provenance tables as columnar Parquet files
  • Carries GraphForge metadata at the file level (ontology version, IR version, query ID)
  • Persists the compiled ontology runtime tables for rapid startup

Parquet file-level metadata:

graphforge.dataset_kind      = "topology_nodes"
graphforge.ontology_version  = "core-2026.05"
graphforge.writer_version    = "0.5.0"
graphforge.ir_version        = "0.1.0"
graphforge.query_id          = "01J..."
graphforge.provenance_policy = "conservative_min"

The StorageProvider trait is designed to be extended with additional backends in future milestones. No other provider is in scope for v0.5.


Identity and Surrogate Keys

GraphForge uses a dual-key pattern for all first-class objects:

Key Type Purpose
UUID (*_uuid) FixedSizeBinary(16) — UUIDv7 Canonical stable identity. Globally unique. Immutable. Survives project merges, offline generation, and cross-analyst exchanges.
Surrogate (*_id) UInt64 Execution-time optimization. Assigned at ingest/load time. Used for DataFusion join operations. Never exposed in public API results.

Why UUIDv7

UUIDv7 (RFC 9562) is time-ordered within a millisecond, globally unique without coordination, fits in Arrow FixedSizeBinary(16), and supports offline generation on mobile devices or air-gapped systems. See refactor-v0.5.md §5 for the full rationale.

UUID→Surrogate mapping

The relational lowering layer maps node_uuid → node_id once at scan time. All DataFusion join operators use integer surrogates (node_id, edge_id, src_id, dst_id) for performance. Results project back to UUID columns before returning to the caller.

Rule: UUIDs appear in every public API result schema. Surrogates are execution-internal and must never appear in API outputs.

Objects requiring UUID identity

Object UUID column
Node (entity) node_uuid
Edge (relationship) edge_uuid
Document doc_uuid
Provenance event provenance_uuid
Analyst/User analyst_uuid
Project project_uuid
Workflow workflow_uuid
Embedding embedding_uuid
Source reference source_uuid
Ranking output row rank_uuid
Clustering output row cluster_uuid
Generated artifact artifact_uuid

Storage Layout

GraphForge organises a project as a directory tree. The graph is one asset among several. Folders are capability modules — an absent folder means that capability is not yet enabled, not that it is misconfigured.

project/
├── graphforge.yaml     # project manifest (includes capabilities: field)
├── ontology.yaml       # domain ontology (optional)
│                       # ── GRAPH LAYER ──
├── topology/           # hot path — identity + type only
│   ├── nodes.parquet
│   ├── generation.json # topology_generation counter (staleness signal for derived indexes)
│   └── edges/          # one file per relation type
│       ├── WORKS_AT.parquet
│       └── OWNS.parquet
├── properties/         # warm path — entity property columns
│   ├── Person.parquet
│   └── Organization.parquet
├── indexes/adjacency/  # derived traversal accelerator (ADR 0005)
│                       # ── KNOWLEDGE LAYER ──
├── provenance/         # provenance events + lineage (Designed)
│   ├── events.parquet
│   └── lineage.parquet
├── knowledge/          # epistemic model — assertions, supersession, evidence, reasoning (ADR 0007, Designed)
│   ├── assertions.parquet
│   ├── supersession.parquet
│   ├── evidence.parquet
│   └── reasoning.parquet
│                       # ── WORKBENCH LAYER ──
├── documents/          # raw source files
├── embeddings/         # vector stores per model
├── indexes/            # FTS indexes (Tantivy)
├── workflows/          # saved analysis pipelines
├── artifacts/          # ranked/clustered outputs
└── sync/               # changeset-based collaboration (post-v0.5)

A minimal project needs only topology/ + properties/ (graph layer). All other folders are capability modules — absent means the capability is not enabled, not misconfigured. Folders group by layer (ADR 0006); capabilities are additive.


Derived Indexes

The indexes/ folder holds derived, rebuildable acceleration structures. Nothing here is canonical: every file under indexes/ can be reconstructed from topology/ (and, for FTS, properties/) alone. An absent index is not an error — it means the accelerator has not been built yet, and the engine falls back to building in memory on demand. See ADR 0005.

indexes/adjacency/ — graph-native adjacency index

The adjacency index is a derived CSR (compressed sparse row) representation of the topology, used by both the Cypher traversal path (variable-length Expand) and the analyst verbs (rank/cluster/paths/analyze/similar). It is optional: absent ⇒ build in memory on demand (today's behavior); present ⇒ load from disk. It is surrogate-keyed and never changes results — only speed.

indexes/
└── adjacency/
    ├── index_manifest.parquet
    ├── WORKS_AT.out.csr
    ├── WORKS_AT.in.csr
    ├── OWNS.out.csr
    ├── OWNS.in.csr
    ├── _all.out.csr          # union across relation types (for via=None)
    └── _all.in.csr

The builder (gf_storage::adjacency::build_adjacency_index) writes one {out, in} pair per relation type plus the _all union pair, then the manifest last. Relation names unusable as file stems (path separators, .., the reserved _all) are skipped — those relations are served by scan-build, but their rows still flow into the union index. The manifest is stamped with the topology_generation read before the edge scan, so a concurrent topology write mid-build can only make the result read as stale, never as falsely fresh.

index_manifest.parquet

Column Arrow type Notes
relation_type Utf8 Relation type name, or _all for the union index
direction Utf8 "out" | "in"
topology_generation UInt64 Generation counter the CSR was built from; bumped on every topology write
built_at Timestamp(Microseconds, UTC)
node_count UInt64 Number of source nodes covered (CSR row count)
edge_count UInt64 Number of (edge, neighbor) entries

CSR file (<REL_TYPE>.<dir>.csr) — single-batch Arrow IPC, one column, one row per surrogate node_id ∈ 0..node_count:

Column Arrow type Notes
adjacency LargeList<Struct { edge_id: UInt64, neighbor_id: UInt64 }> Row i holds the adjacency entries of surrogate node_id = i, in CSR order

This is the CSR structure in its idiomatic Arrow encoding — the two logical arrays cannot be two top-level columns because a RecordBatch requires equal column lengths. The list's offsets buffer is the CSR offsets array (length node_count + 1, Int64, starting at 0, monotone), and the flattened struct child is the targets array (length edge_count): neighbors of node_id = i are targets[offsets[i]..offsets[i+1]], zero-copy on read.

Conventions:

  • Empty graph: a zero-row batch — logical offsets == [0], empty targets. The offsets array is never empty.
  • Node with no neighbors: an empty list (offsets[i] == offsets[i+1]).
  • CSR rows cover exactly node_id ∈ 0..node_count; surrogates beyond node_count simply have no entries.
  • In-memory consumers (gf_exec::AdjacencyProvider, #760 — scan-build via ScanBuildAdjacencyProvider until the persistent loader, #761) see the logical offsets/targets model; the list encoding is a file-format detail (gf_storage::adjacency::CsrIndex on the storage side).

Rebuild and versioning semantics

  • Source of truth. A CSR is always reconstructable from topology/edges/<REL_TYPE>.parquet alone, deterministically.
  • Counter persistence. The project topology_generation counter lives at topology/generation.json as {"topology_generation": N}, written atomically (sibling temp + rename). An absent file is generation 0 — safe, because every binary that writes indexes/adjacency/ also bumps, so no index manifest can predate the counter. The counter is machine-owned mutable state; it lives beside the topology it versions, not in the user-editable graphforge.yaml.
  • Bump rule. The counter is bumped immediately before committing any staged batch that rewrites topology/nodes.parquet or any file under topology/edges/ (including _exploratory.parquet) — one bump per committed batch (gf_storage::commit_topology_aware). Property-only writes (properties/, edge_properties/) never bump: properties cannot change adjacency.
  • Crash-safety invariant (bump-before-commit). A crash after the bump but before or during the commit leaves the counter advanced over unchanged topology: the index merely looks stale and is rebuilt — one spurious rebuild, correct results. The reverse order would be unsound: new topology under the old counter makes a stale index look fresh. Spurious bumps are safe; missed bumps are not.
  • Staleness detection. The provider compares the manifest's topology_generation against the current value. A corrupt counter file or manifest marks the index always-stale, never fresh.
  • Fallback. On mismatch (or absent index), the provider scans the typed edge tables and builds the adjacency in memory — yielding identical results, only slower. A stale or missing index can therefore never cause incorrect output.
  • Rebuild triggers. Lazy on first traversal when the indexes/adjacency/ capability is present, or explicit via forge.index("adjacency", ...). Incremental rebuild (append-delta + compaction) is deferred to v0.5.1.
  • Determinism (R-ADJ-2). Full rebuild streams each typed edge file once; out entries sort by (src_id, edge_id) and in entries by (dst_id, edge_id) — the edge_id tie-break makes the CSR bytes reproducible from topology/ alone. _all.{out,in}.csr are the same sorts over the union of all typed files plus _exploratory.parquet. The manifest's built_at is excluded from the determinism guarantee.
  • Build ordering. Builders write all CSR files first and index_manifest.parquet last, so a torn build reads as stale (absent/old manifest), never as fresh.
  • Loader semantics (gf_exec::PersistentAdjacencyProvider, #761). Freshness = manifest non-empty AND every row's topology_generation equals the current counter. Fresh + row present ⇒ load (adjacency=hit); stale or torn ⇒ lazy rebuild, then serve; fresh but no row for the requested relation ⇒ scan-build without rebuild (rebuilding cannot add an unknown relation — prevents a rebuild-per-query loop); corrupt counter or manifest ⇒ always-stale scan-build without rebuild (stamping needs a readable counter); capability dir absent ⇒ scan-build (adjacency=building). Typed-mode "*" bypasses the index entirely (pre-existing empty-scan behavior; reported as building, never a false miss). A build or load failure never fails the query — only its speed.
  • Direction. out and in CSRs are stored separately; undirected traversal unions them. In exploratory mode, _exploratory.parquet rows are routed by their rel_type_name column.

indexes/<LABEL>/tantivy/ — full-text search index

Full-text indexes (Tantivy) are also derived and rebuildable from properties/. See the Find milestone (forge.find / forge.index).


Graph Fact Schema

Topology layer (hot path)

Graph traversal reads only the topology layer. No property columns are read unless the query explicitly projects them.

topology/nodes.parquet

Column Arrow type Notes
node_uuid FixedSizeBinary(16) UUIDv7 — canonical stable identity
node_id UInt64 Local surrogate — DataFusion join key
type_id UInt32 Ontology entity type integer
created_at Timestamp(Microseconds, UTC)
updated_at Timestamp(Microseconds, UTC)

topology/edges/TYPENAME.parquet (one file per relation type)

Column Arrow type Notes
edge_uuid FixedSizeBinary(16) UUIDv7
src_uuid FixedSizeBinary(16) References node_uuid
dst_uuid FixedSizeBinary(16) References node_uuid
edge_id UInt64 Local surrogate
src_id UInt64 Local surrogate — DataFusion join key
dst_id UInt64 Local surrogate — DataFusion join key
created_at Timestamp(Microseconds, UTC)
confidence Float64 0.0–1.0
provenance_uuid FixedSizeBinary(16) References provenance/events.parquet

Typed edge tables (one Parquet file per relation type) replace the unified edge_facts table. This enables direct scans on a single relation type without filtering, yielding significant I/O savings at 100M+ edges. See refactor-v0.5.md §7 for performance analysis.

Properties layer (warm path)

properties/ENTITY_TYPE.parquet (one file per entity type, columns per ontology)

Column Arrow type Notes
node_uuid FixedSizeBinary(16) Join key back to topology/nodes.parquet
(property columns) (per ontology) e.g. name Utf8, age Int64, email Utf8

Property access is a join: topology/nodes JOIN properties/PERSON ON node_uuid. DataFusion handles this as a hash join. The separation allows graph traversal to skip property I/O entirely.

Provenance layer (Knowledge Layer)

Layer + status (ADR 0006). provenance/ is part of the knowledge layer, not the graph layer. Graph rows carry a provenance_uuid reference; the events/lineage live here. Status: Designed — these tables are schema-defined but not yet written (the write path is owned by the Knowledge Layer Foundation milestone). The epistemic model (ADR 0007) adds knowledge/ tables (below) on top of this layer.

provenance/events.parquet

Column Arrow type Notes
provenance_uuid FixedSizeBinary(16) UUIDv7
kind Utf8 "ingestion" | "inference" | "assertion" | "merge"
source_ref FixedSizeBinary(16) UUID of source document or system
analyst_uuid FixedSizeBinary(16) UUID of the contributing analyst
rule_id Utf8 Ontology inference rule ID (nullable)
confidence Float64 Confidence score for this event
query_id Utf8 UUID of the query that produced this fact
created_at Timestamp(Microseconds, UTC)

provenance/lineage.parquet

Column Arrow type Notes
parent_uuid FixedSizeBinary(16) Upstream provenance event
child_uuid FixedSizeBinary(16) Downstream provenance event
role Utf8 "input" | "derived" | "merged"
weight Float64 Contribution weight

Knowledge layer — epistemic tables (knowledge/)

Layer + status (ADR 0006 / ADR 0007). A capability folder reserved for the epistemic model — how an analyst's understanding evolves. Status: Designed (v0.5.0 target, Full scope; owned by the Epistemic Model milestone). All tables reference graph objects by *_uuid; none alter graph topology, and graph-native query results never depend on them. The folder is a capability module — absent means the epistemic model is simply not enabled, and the graph behaves exactly as before. Full schemas in ADR 0007.

knowledge/
├── assertions.parquet     # claims about graph objects: status (hypothesis|supported|refuted|
│                          # superseded|disputed), confidence, hypothesis_group, bitemporal
│                          # valid_from/valid_to + recorded_at/retracted_at — never deleted
├── supersession.parquet   # superseding_uuid → superseded_uuid (+ reason) — append-only history
├── evidence.parquet       # assertion_uuid → source_uuid, role (supports|contradicts|context)
└── reasoning.parquet      # assertion_uuid → why concluded / why an alternative was rejected

Key properties (preservation-over-deletion): refuting or superseding a claim preserves the prior assertion, its evidence, and its reasoning; competing assertions about the same question coexist via a shared hypothesis_group; bitemporal valid-time answers "what did we believe, when, and why did it change?" Bitemporal querying is capability-gated and off by default, with an assertion-time-only fallback (see ADR 0007).


Ontology Runtime

The ontology is a runtime-loadable knowledge schema, not Rust structs generated into the binary. Three representations serve different purposes:

Format Purpose
YAML / JSON Human-authored ontology definitions (Serde-based load)
Arrow tables Compiled execution format — cheap joins during binding and planning
Parquet Persisted for rapid startup or reproducible deployments

Ontology authoring format (YAML)

ontology_id: core
version: "2026.05"
entity_types:
  - name: Person
    abstract: false
  - name: Employee
    parent: Person
relation_types:
  - name: MANAGES
    src: Employee
    dst: Employee
    inverse: MANAGED_BY
    semantic:
      transitive: false
      symmetric: false
      functional: false
properties:
  - owner: Person
    name: name
    type: utf8
    nullable: false
constraints:
  - owner: Employee
    kind: unique_property
    expr:
      property: employee_id

At load time this compiles into Arrow lookup tables keyed by integer type IDs. String-heavy lookups during planning become O(1) integer comparisons.

Ontology runtime tables

Table Purpose
ontology_meta Identity, version, IR compatibility range, checksum
entity_types Node classes and inheritance DAG (acyclicity enforced at load)
relation_types Edge classes, endpoint type constraints, inverse pairs
property_types Name, owner, value type, nullability, cardinality
type_constraints Validation rules (unique, required, range)
cardinality_rules Endpoint multiplicity (min/max per relation type)
semantic_flags transitive, symmetric, reflexive, functional, acyclic
aliases Human-facing and deprecated names
migrations Versioned ontology upgrade transforms

Ontology versioning

Two independent version axes:

Axis Meaning
ontology_version Meaning of types and rules — changes when the schema evolves
ir_version Runtime/compiler contract — changes when the IR format changes

A new ontology version does not require a new IR version, and vice versa. Persisted datasets record the ontology_version used to write them. Arrow schema metadata carries both versions through IPC and Parquet round-trips.

Validation model

Level When Examples
Ontology-load On file/table load Duplicate names, missing parents, inheritance cycles, bad inverse references
Write-time On CREATE, MERGE, batch ingest Unknown property, wrong value type, illegal endpoint type, cardinality overflow
Query-time During binding/planning Unknown labels/types/properties, illegal pattern shape, ambiguous property resolution

Serialization Systems

Never mix these two systems:

System Purpose Format
Arrow / Parquet (gf-storage) Graph topology, properties, provenance Binary columnar (Arrow IPC / Parquet)
JSON / YAML (gf-ontology) Ontology definitions and metadata Text (human-readable, validatable)

Graph data → Arrow/Parquet. Ontology/metadata → JSON or YAML. Arrow schema metadata carries version and provenance annotations across language boundaries.


Two-Mode Graph Instances

// In-memory (fast, volatile)
let forge = GraphForge::new(None)?;

// Persistent (project directory)
let forge = GraphForge::new(Some("path/to/project/"))?;

The storage layer is transparent to all API surfaces.


References