Storage Architecture¶

Status: Rust Core — active development on main Last Updated: 2026-05-31

Overview¶

GraphForge uses a pluggable storage provider model. No provider is the semantic owner of the query language. All providers implement a common Rust trait and are selected at runtime based on the use case.

pub trait StorageProvider: Send + Sync {
    fn provider_name(&self) -> &'static str;
    fn table_provider(&self, table: &QualifiedTable) -> Result<Arc<dyn TableProvider>, GfError>;
    fn capabilities(&self) -> ProviderCapabilities;
}

Provider Role: Parquet¶

Parquet is the sole storage provider for the Rust core. It:

Stores topology, property, and provenance tables as columnar Parquet files
Carries GraphForge metadata at the file level (ontology version, IR version, query ID)
Persists the compiled ontology runtime tables for rapid startup

Parquet file-level metadata:

graphforge.dataset_kind      = "topology_nodes"
graphforge.ontology_version  = "core-2026.05"
graphforge.writer_version    = "0.5.0"
graphforge.ir_version        = "0.1.0"
graphforge.query_id          = "01J..."
graphforge.provenance_policy = "conservative_min"

The StorageProvider trait is designed to be extended with additional backends in future milestones. No other provider is in scope for v0.5.

Identity and Surrogate Keys¶

GraphForge uses a dual-key pattern for all first-class objects:

Key	Type	Purpose
UUID (`*_uuid`)	`FixedSizeBinary(16)` — UUIDv7	Canonical stable identity. Globally unique. Immutable. Survives project merges, offline generation, and cross-analyst exchanges.
Surrogate (`*_id`)	`UInt64`	Execution-time optimization. Assigned at ingest/load time. Used for DataFusion join operations. Never exposed in public API results.

Why UUIDv7¶

UUIDv7 (RFC 9562) is time-ordered within a millisecond, globally unique without coordination, fits in Arrow FixedSizeBinary(16), and supports offline generation on mobile devices or air-gapped systems. See refactor-v0.5.md §5 for the full rationale.

UUID→Surrogate mapping¶

The relational lowering layer maps node_uuid → node_id once at scan time. All DataFusion join operators use integer surrogates (node_id, edge_id, src_id, dst_id) for performance. Results project back to UUID columns before returning to the caller.

Rule: UUIDs appear in every public API result schema. Surrogates are execution-internal and must never appear in API outputs.

Objects requiring UUID identity¶

Object	UUID column
Node (entity)	`node_uuid`
Edge (relationship)	`edge_uuid`
Document	`doc_uuid`
Provenance event	`provenance_uuid`
Analyst/User	`analyst_uuid`
Project	`project_uuid`
Workflow	`workflow_uuid`
Embedding	`embedding_uuid`
Source reference	`source_uuid`
Ranking output row	`rank_uuid`
Clustering output row	`cluster_uuid`
Generated artifact	`artifact_uuid`

Storage Layout¶

GraphForge organises a project as a directory tree. The graph is one asset among several. Folders are capability modules — an absent folder means that capability is not yet enabled, not that it is misconfigured.

project/
├── graphforge.yaml     # project manifest (includes capabilities: field)
├── ontology.yaml       # domain ontology (optional)
│                       # ── GRAPH LAYER ──
├── topology/           # hot path — identity + type only
│   ├── nodes.parquet
│   ├── generation.json # topology_generation counter (staleness signal for derived indexes)
│   └── edges/          # one file per relation type
│       ├── WORKS_AT.parquet
│       └── OWNS.parquet
├── properties/         # warm path — entity property columns
│   ├── Person.parquet
│   └── Organization.parquet
├── indexes/adjacency/  # derived traversal accelerator (ADR 0005)
│                       # ── KNOWLEDGE LAYER ──
├── provenance/         # provenance events + lineage (Designed)
│   ├── events.parquet
│   └── lineage.parquet
├── knowledge/          # epistemic model — assertions, supersession, evidence, reasoning (ADR 0007, Designed)
│   ├── assertions.parquet
│   ├── supersession.parquet
│   ├── evidence.parquet
│   └── reasoning.parquet
│                       # ── WORKBENCH LAYER ──
├── documents/          # raw source files
├── embeddings/         # vector stores per model
├── indexes/            # FTS indexes (Tantivy)
├── workflows/          # saved analysis pipelines
├── artifacts/          # ranked/clustered outputs
└── sync/               # changeset-based collaboration (post-v0.5)

A minimal project needs only topology/ + properties/ (graph layer). All other folders are capability modules — absent means the capability is not enabled, not misconfigured. Folders group by layer (ADR 0006); capabilities are additive.

Derived Indexes¶

The indexes/ folder holds derived, rebuildable acceleration structures. Nothing here is canonical: every file under indexes/ can be reconstructed from topology/ (and, for FTS, properties/) alone. An absent index is not an error — it means the accelerator has not been built yet, and the engine falls back to building in memory on demand. See ADR 0005.

`indexes/adjacency/` — graph-native adjacency index¶

The adjacency index is a derived CSR (compressed sparse row) representation of the topology, used by both the Cypher traversal path (variable-length Expand) and the analyst verbs (rank/cluster/paths/analyze/similar). It is optional: absent ⇒ build in memory on demand (today's behavior); present ⇒ load from disk. It is surrogate-keyed and never changes results — only speed.

indexes/
└── adjacency/
    ├── index_manifest.parquet
    ├── WORKS_AT.out.csr
    ├── WORKS_AT.in.csr
    ├── OWNS.out.csr
    ├── OWNS.in.csr
    ├── _all.out.csr          # union across relation types (for via=None)
    └── _all.in.csr

The builder (gf_storage::adjacency::build_adjacency_index) writes one {out, in} pair per relation type plus the _all union pair, then the manifest last. Relation names unusable as file stems (path separators, .., the reserved _all) are skipped — those relations are served by scan-build, but their rows still flow into the union index. The manifest is stamped with the topology_generation read before the edge scan, so a concurrent topology write mid-build can only make the result read as stale, never as falsely fresh.

index_manifest.parquet

Column	Arrow type	Notes
`relation_type`	`Utf8`	Relation type name, or `_all` for the union index
`direction`	`Utf8`	`"out"` \| `"in"`
`topology_generation`	`UInt64`	Generation counter the CSR was built from; bumped on every topology write
`built_at`	`Timestamp(Microseconds, UTC)`
`node_count`	`UInt64`	Number of source nodes covered (CSR row count)
`edge_count`	`UInt64`	Number of `(edge, neighbor)` entries

CSR file (<REL_TYPE>.<dir>.csr) — single-batch Arrow IPC, one column, one row per surrogate node_id ∈ 0..node_count:

Column	Arrow type	Notes
`adjacency`	`LargeList<Struct { edge_id: UInt64, neighbor_id: UInt64 }>`	Row `i` holds the adjacency entries of surrogate `node_id = i`, in CSR order

This is the CSR structure in its idiomatic Arrow encoding — the two logical arrays cannot be two top-level columns because a RecordBatch requires equal column lengths. The list's offsets buffer is the CSR offsets array (length node_count + 1, Int64, starting at 0, monotone), and the flattened struct child is the targets array (length edge_count): neighbors of node_id = i are targets[offsets[i]..offsets[i+1]], zero-copy on read.

Conventions:

Empty graph: a zero-row batch — logical offsets == [0], empty targets. The offsets array is never empty.
Node with no neighbors: an empty list (offsets[i] == offsets[i+1]).
CSR rows cover exactly node_id ∈ 0..node_count; surrogates beyond node_count simply have no entries.
In-memory consumers (gf_exec::AdjacencyProvider, #760 — scan-build via ScanBuildAdjacencyProvider until the persistent loader, #761) see the logical offsets/targets model; the list encoding is a file-format detail (gf_storage::adjacency::CsrIndex on the storage side).

Rebuild and versioning semantics¶

Source of truth. A CSR is always reconstructable from topology/edges/<REL_TYPE>.parquet alone, deterministically.
Counter persistence. The project topology_generation counter lives at topology/generation.json as {"topology_generation": N}, written atomically (sibling temp + rename). An absent file is generation 0 — safe, because every binary that writes indexes/adjacency/ also bumps, so no index manifest can predate the counter. The counter is machine-owned mutable state; it lives beside the topology it versions, not in the user-editable graphforge.yaml.
Bump rule. The counter is bumped immediately before committing any staged batch that rewrites topology/nodes.parquet or any file under topology/edges/ (including _exploratory.parquet) — one bump per committed batch (gf_storage::commit_topology_aware). Property-only writes (properties/, edge_properties/) never bump: properties cannot change adjacency.
Crash-safety invariant (bump-before-commit). A crash after the bump but before or during the commit leaves the counter advanced over unchanged topology: the index merely looks stale and is rebuilt — one spurious rebuild, correct results. The reverse order would be unsound: new topology under the old counter makes a stale index look fresh. Spurious bumps are safe; missed bumps are not.
Staleness detection. The provider compares the manifest's topology_generation against the current value. A corrupt counter file or manifest marks the index always-stale, never fresh.
Fallback. On mismatch (or absent index), the provider scans the typed edge tables and builds the adjacency in memory — yielding identical results, only slower. A stale or missing index can therefore never cause incorrect output.
Rebuild triggers. Lazy on first traversal when the indexes/adjacency/ capability is present, or explicit via forge.index("adjacency", ...). Incremental rebuild (append-delta + compaction) is deferred to v0.5.1.
Determinism (R-ADJ-2). Full rebuild streams each typed edge file once; out entries sort by (src_id, edge_id) and in entries by (dst_id, edge_id) — the edge_id tie-break makes the CSR bytes reproducible from topology/ alone. _all.{out,in}.csr are the same sorts over the union of all typed files plus _exploratory.parquet. The manifest's built_at is excluded from the determinism guarantee.
Build ordering. Builders write all CSR files first and index_manifest.parquet last, so a torn build reads as stale (absent/old manifest), never as fresh.
Loader semantics (gf_exec::PersistentAdjacencyProvider, #761). Freshness = manifest non-empty AND every row's topology_generation equals the current counter. Fresh + row present ⇒ load (adjacency=hit); stale or torn ⇒ lazy rebuild, then serve; fresh but no row for the requested relation ⇒ scan-build without rebuild (rebuilding cannot add an unknown relation — prevents a rebuild-per-query loop); corrupt counter or manifest ⇒ always-stale scan-build without rebuild (stamping needs a readable counter); capability dir absent ⇒ scan-build (adjacency=building). Typed-mode "*" bypasses the index entirely (pre-existing empty-scan behavior; reported as building, never a false miss). A build or load failure never fails the query — only its speed.
Direction. out and in CSRs are stored separately; undirected traversal unions them. In exploratory mode, _exploratory.parquet rows are routed by their rel_type_name column.

`indexes/<LABEL>/tantivy/` — full-text search index¶

Full-text indexes (Tantivy) are also derived and rebuildable from properties/. See the Find milestone (forge.find / forge.index).

Graph Fact Schema¶

Topology layer (hot path)¶

Graph traversal reads only the topology layer. No property columns are read unless the query explicitly projects them.

topology/nodes.parquet

Column	Arrow type	Notes
`node_uuid`	`FixedSizeBinary(16)`	UUIDv7 — canonical stable identity
`node_id`	`UInt64`	Local surrogate — DataFusion join key
`type_id`	`UInt32`	Ontology entity type integer
`created_at`	`Timestamp(Microseconds, UTC)`
`updated_at`	`Timestamp(Microseconds, UTC)`

topology/edges/TYPENAME.parquet (one file per relation type)

Column	Arrow type	Notes
`edge_uuid`	`FixedSizeBinary(16)`	UUIDv7
`src_uuid`	`FixedSizeBinary(16)`	References `node_uuid`
`dst_uuid`	`FixedSizeBinary(16)`	References `node_uuid`
`edge_id`	`UInt64`	Local surrogate
`src_id`	`UInt64`	Local surrogate — DataFusion join key
`dst_id`	`UInt64`	Local surrogate — DataFusion join key
`created_at`	`Timestamp(Microseconds, UTC)`
`confidence`	`Float64`	0.0–1.0
`provenance_uuid`	`FixedSizeBinary(16)`	References `provenance/events.parquet`

Typed edge tables (one Parquet file per relation type) replace the unified edge_facts table. This enables direct scans on a single relation type without filtering, yielding significant I/O savings at 100M+ edges. See refactor-v0.5.md §7 for performance analysis.

Properties layer (warm path)¶

properties/ENTITY_TYPE.parquet (one file per entity type, columns per ontology)

Column	Arrow type	Notes
`node_uuid`	`FixedSizeBinary(16)`	Join key back to `topology/nodes.parquet`
(property columns)	(per ontology)	e.g. `name Utf8`, `age Int64`, `email Utf8`

Property access is a join: topology/nodes JOIN properties/PERSON ON node_uuid. DataFusion handles this as a hash join. The separation allows graph traversal to skip property I/O entirely.

Provenance layer (Knowledge Layer)¶

Layer + status (ADR 0006). provenance/ is part of the knowledge layer, not the graph layer. Graph rows carry a provenance_uuid reference; the events/lineage live here. Status: Designed — these tables are schema-defined but not yet written (the write path is owned by the Knowledge Layer Foundation milestone). The epistemic model (ADR 0007) adds knowledge/ tables (below) on top of this layer.

provenance/events.parquet

Column	Arrow type	Notes
`provenance_uuid`	`FixedSizeBinary(16)`	UUIDv7
`kind`	`Utf8`	`"ingestion"` \| `"inference"` \| `"assertion"` \| `"merge"`
`source_ref`	`FixedSizeBinary(16)`	UUID of source document or system
`analyst_uuid`	`FixedSizeBinary(16)`	UUID of the contributing analyst
`rule_id`	`Utf8`	Ontology inference rule ID (nullable)
`confidence`	`Float64`	Confidence score for this event
`query_id`	`Utf8`	UUID of the query that produced this fact
`created_at`	`Timestamp(Microseconds, UTC)`

provenance/lineage.parquet

Column	Arrow type	Notes
`parent_uuid`	`FixedSizeBinary(16)`	Upstream provenance event
`child_uuid`	`FixedSizeBinary(16)`	Downstream provenance event
`role`	`Utf8`	`"input"` \| `"derived"` \| `"merged"`
`weight`	`Float64`	Contribution weight

Knowledge layer — epistemic tables (`knowledge/`)¶

Layer + status (ADR 0006 / ADR 0007). A capability folder reserved for the epistemic model — how an analyst's understanding evolves. Status: Designed (v0.5.0 target, Full scope; owned by the Epistemic Model milestone). All tables reference graph objects by *_uuid; none alter graph topology, and graph-native query results never depend on them. The folder is a capability module — absent means the epistemic model is simply not enabled, and the graph behaves exactly as before. Full schemas in ADR 0007.

knowledge/
├── assertions.parquet     # claims about graph objects: status (hypothesis|supported|refuted|
│                          # superseded|disputed), confidence, hypothesis_group, bitemporal
│                          # valid_from/valid_to + recorded_at/retracted_at — never deleted
├── supersession.parquet   # superseding_uuid → superseded_uuid (+ reason) — append-only history
├── evidence.parquet       # assertion_uuid → source_uuid, role (supports|contradicts|context)
└── reasoning.parquet      # assertion_uuid → why concluded / why an alternative was rejected

Key properties (preservation-over-deletion): refuting or superseding a claim preserves the prior assertion, its evidence, and its reasoning; competing assertions about the same question coexist via a shared hypothesis_group; bitemporal valid-time answers "what did we believe, when, and why did it change?" Bitemporal querying is capability-gated and off by default, with an assertion-time-only fallback (see ADR 0007).

Ontology Runtime¶

The ontology is a runtime-loadable knowledge schema, not Rust structs generated into the binary. Three representations serve different purposes:

Format	Purpose
YAML / JSON	Human-authored ontology definitions (Serde-based load)
Arrow tables	Compiled execution format — cheap joins during binding and planning
Parquet	Persisted for rapid startup or reproducible deployments

Ontology authoring format (YAML)¶

ontology_id: core
version: "2026.05"
entity_types:
  - name: Person
    abstract: false
  - name: Employee
    parent: Person
relation_types:
  - name: MANAGES
    src: Employee
    dst: Employee
    inverse: MANAGED_BY
    semantic:
      transitive: false
      symmetric: false
      functional: false
properties:
  - owner: Person
    name: name
    type: utf8
    nullable: false
constraints:
  - owner: Employee
    kind: unique_property
    expr:
      property: employee_id

At load time this compiles into Arrow lookup tables keyed by integer type IDs. String-heavy lookups during planning become O(1) integer comparisons.

Ontology runtime tables¶

Table	Purpose
`ontology_meta`	Identity, version, IR compatibility range, checksum
`entity_types`	Node classes and inheritance DAG (acyclicity enforced at load)
`relation_types`	Edge classes, endpoint type constraints, inverse pairs
`property_types`	Name, owner, value type, nullability, cardinality
`type_constraints`	Validation rules (unique, required, range)
`cardinality_rules`	Endpoint multiplicity (min/max per relation type)
`semantic_flags`	`transitive`, `symmetric`, `reflexive`, `functional`, `acyclic`
`aliases`	Human-facing and deprecated names
`migrations`	Versioned ontology upgrade transforms

Ontology versioning¶

Two independent version axes:

Axis	Meaning
`ontology_version`	Meaning of types and rules — changes when the schema evolves
`ir_version`	Runtime/compiler contract — changes when the IR format changes

A new ontology version does not require a new IR version, and vice versa. Persisted datasets record the ontology_version used to write them. Arrow schema metadata carries both versions through IPC and Parquet round-trips.

Validation model¶

Level	When	Examples
Ontology-load	On file/table load	Duplicate names, missing parents, inheritance cycles, bad inverse references
Write-time	On `CREATE`, `MERGE`, batch ingest	Unknown property, wrong value type, illegal endpoint type, cardinality overflow
Query-time	During binding/planning	Unknown labels/types/properties, illegal pattern shape, ambiguous property resolution

Serialization Systems¶

Never mix these two systems:

System	Purpose	Format
Arrow / Parquet (`gf-storage`)	Graph topology, properties, provenance	Binary columnar (Arrow IPC / Parquet)
JSON / YAML (`gf-ontology`)	Ontology definitions and metadata	Text (human-readable, validatable)

Graph data → Arrow/Parquet. Ontology/metadata → JSON or YAML. Arrow schema metadata carries version and provenance annotations across language boundaries.

Two-Mode Graph Instances¶

// In-memory (fast, volatile)
let forge = GraphForge::new(None)?;

// Persistent (project directory)
let forge = GraphForge::new(Some("path/to/project/"))?;

The storage layer is transparent to all API surfaces.

References¶

Architecture Overview — workspace layout and provider trait
Architecture Refactor v0.5 — UUID identity model, typed edge tables, project structure
Execution Model — how providers connect to DataFusion
ADR 0002: Rust Core — Parquet-as-primary and provider strategy