Execution Model¶

Status: Rust Core — active development on main Last Updated: 2026-06-07

Implementation status legend (used across architecture docs): Shipped = implemented and tested on main; Partially built = some paths real, others stubbed; Designed = specified, not yet implemented. As of this writing: the Cypher read pipeline, CREATE, variable-length traversal, OPTIONAL MATCH, and UNWIND are Shipped; the analyst verbs (rank/cluster/paths/ analyze/similar/find) are Designed (present in the Python 0.4.x shim; NotImplemented in the Rust core); provenance and confidence are Designed (schema defined; recorded/propagated by the Knowledge Layer Foundation milestone).

Overview¶

GraphForge uses DataFusion as the primary execution backbone. DataFusion provides the logical/physical planning pipeline, optimizer, table providers, and Arrow-native batch execution. GraphForge extends it with custom physical plan nodes for operations that require graph-native semantics.

The v0.5.0 unified API exposes seven entry points — execute, rank, cluster, paths, analyze, similar, and find — all producing Arrow Tables. execute travels the full Cypher compiler pipeline; the analyst verbs bypass the parser and planner but converge on the same DataFusion execution layer. execute is Shipped; the six analyst verbs are Designed in the Rust core (see status legend above). These verbs are workbench-layer surfaces (ADR 0006): they consume the graph and knowledge layers and hold no graph-semantic state.

Polars is used as a storage-layer companion for IO and sinks (CSV, JSON, Parquet, IPC). It is not the semantic owner of query execution.

Why DataFusion¶

DataFusion exposes the extension points GraphForge needs:

Custom LogicalPlan nodes — graph operators slot in alongside relational operators
Custom ExecutionPlan nodes — variable-length path expansion and provenance-aware joins as physical operators
Custom TableProvider — Parquet as a pluggable scan source
Custom optimizer rules — predicate pushdown and label-constraint hoisting
Custom QueryPlanner — GraphForge IR is lowered to DataFusion's logical plan rather than parsed SQL

DataFusion's Arrow-native execution means results already arrive as RecordBatch streams — no additional conversion is needed before returning to Python, Node, Swift, or Kotlin.

Execution Pipeline¶

GraphPlan (Graph IR)
      ↓
┌────────────────────────────┐
│  gf-rel: relational lower  │
│  Simple ops → LogicalPlan  │
│  Graph ops → custom nodes  │
└────────────────────────────┘
      ↓
┌────────────────────────────┐
│  DataFusion Analyzer       │
│  DataFusion Optimizer      │
│  (predicate pushdown, etc) │
└────────────────────────────┘
      ↓
┌────────────────────────────┐
│  DataFusion Physical Plan  │
│  gf custom PhysicalNodes   │
└────────────────────────────┘
      ↓
  SendableRecordBatchStream   ← Arrow columnar batches
      ↓
  gf-bindings-py / gf-bindings-node / gf-bindings-uniffi (Swift, Kotlin) / Rust API

Custom Graph Execution Nodes¶

GraphForge registers custom ExecutionPlan implementations with DataFusion for operators that cannot be faithfully expressed as relational algebra:

Node	Description
`VarLenExpand`	Iterative or recursive expansion for `*min..max` path patterns
`OptionalMatch`	Left-join semantics with Cypher null-shaping (distinct from SQL LEFT JOIN)
`PathUnique`	Path isomorphism/homomorphism enforcement
`ProvenanceSemijoin`	Semijoin with confidence propagation
`OntologyInfer`	Transitive/symmetric closure materialization
`GraphMerge`	MERGE upsert with write-path locking

All other operators (scan, filter, project, aggregate, sort, limit) run through standard DataFusion physical nodes.

Adjacency-Backed Execution¶

Traversal is fundamentally adjacency-oriented. GraphForge maintains a derived, rebuildable adjacency index (CSR, surrogate-keyed) under indexes/adjacency/ (ADR 0005, storage.md §Derived Indexes). Both execution paths consume it through a single AdjacencyProvider abstraction:

Cypher traversal path — ExpandExec / VarLenExpandExec ask the provider for a node's neighbors instead of rebuilding adjacency from a full edge scan per query.
Analyst-verb path — export_adjacency is a thin adapter from the same provider to the AdjacencyGraph consumed by the AlgorithmBackend trait.

Node	Description
`VarLenExpandExec`	Iterative BFS over the `AdjacencyProvider` for `*min..max` patterns (replaces the per-query in-memory adjacency build)
`ExpandExec`	Adjacency-backed single-hop expansion (#763): chosen at lowering time when the provider reports a `hit` for a typed relation (any direction; undirected wraps in `DISTINCT`, mirroring the join path's union+distinct). Exploratory single-hop and uncovered patterns keep the DataFusion join chain.

Selection is a planner choice, not an IR change. The Graph IR is unchanged: variable-length traversal is still encoded on Expand { …, min_hops, max_hops }. A lowering rule selects an adjacency-backed physical node when the provider covers the relation type + direction, and falls back to the DataFusion hash-join path otherwise. Both paths produce identical results; only speed differs.

The index is optional and never authoritative: a stale or missing index falls back to scan-and-build with identical output (the topology_generation counter detects staleness). The canonical Parquet topology is always the source of truth.

explain annotates each traversal node with adjacency=hit | miss | building, so plan inspection shows whether the accelerator was used (ExecutionSession::explain_physical renders the physical plan without executing it). One PersistentAdjacencyProvider lives per session (= per query) with an interior per-(relation, direction) cache, so multi-expand queries load each CSR once; lifting the provider to the long-lived facade for cross-query caching is a planned follow-on.

Analyst Verbs¶

rank(), cluster(), paths(), analyze(), similar(), and find() bypass the Cypher parser and planner entirely. They accept structured arguments, build their own DataFusion sub-plans, and return Arrow Tables through the same result contract as execute().

Method	Bypasses	Uses	Returns
`rank(label, by=…)`	Parser, planner	Adjacency export → algorithm dispatch → DataFusion project	node properties + `score: Float64`
`cluster(label, by=…)`	Parser, planner	Adjacency export → community algorithm → DataFusion project	node properties + `community_id: Int64`
`paths(source, target, by=…)`	Parser, planner	Adjacency export → path algorithm → DataFusion project	`source_id`, `target_id`, `cost`, `path`
`analyze(label, by=…)`	Parser, planner	Adjacency export → structural analysis → DataFusion project	varies by algorithm
`similar(label, by=…)`	Parser, planner	Adjacency export → similarity → DataFusion project	`node1_id`, `node2_id`, `similarity`
`find(query, vector=…)`	Parser, planner	FTS / vector index → RRF fusion → DataFusion project	node properties + `score` + `matched_on`

Key design facts:

No separate result type. All analyst verbs return Arrow Tables exactly as execute() does.
write_property is opt-in mutation. rank(), cluster(), and paths() are read-only by default.
find() indexes lazily. On the first call for a given label, a full-text and/or vector index is built. Call forge.index(label, …) explicitly to control when indexing occurs.
All paths converge at DataFusion. Adjacency export and scoring stages produce Arrow data that DataFusion projects and returns as RecordBatch streams.

Full algorithm catalog: Algorithm Verbs.

Arrow Result Contract¶

All results are returned as Arrow RecordBatch streams. The schema for query results carries GraphForge metadata:

fn result_schema(fields: Vec<Field>, query_id: &str, ontology_ver: &str) -> Schema {
    let mut meta = HashMap::new();
    meta.insert("graphforge.ir_version".into(), "1.0.0".into());
    meta.insert("graphforge.ontology_version".into(), ontology_ver.into());
    meta.insert("graphforge.query_id".into(), query_id.into());
    meta.insert("graphforge.result_kind".into(), "query_result".into());
    Schema::new_with_metadata(fields, meta)
}

Topology node schema (the shipped layout — see crates/gf-storage/src/schemas.rs):

// topology/nodes.parquet — identity + type only (the GRAPH LAYER hot path)
let node_schema = Schema::new(vec![
    Field::new("node_uuid",  DataType::FixedSizeBinary(16), false), // UUIDv7 — canonical identity
    Field::new("node_id",    DataType::UInt64,  false),             // local surrogate — join key
    Field::new("type_id",    DataType::UInt32,  false),             // ontology entity type
    Field::new("created_at", DataType::Timestamp(TimeUnit::Microsecond, Some("UTC".into())), false),
    Field::new("updated_at", DataType::Timestamp(TimeUnit::Microsecond, Some("UTC".into())), false),
]);

Layer note (ADR 0006). The topology node row holds only identity and type. Property values live in properties/ENTITY_TYPE.parquet (graph layer, warm path). Confidence, provenance, and valid-time are knowledge-layer concerns (ADR 0006) — they are recorded in the knowledge layer and attached to graph objects by node_uuid/edge_uuid, not as columns on topology tables. Bitemporal valid-time lives on epistemic assertions (ADR 0007), not on raw nodes. (Earlier drafts of this doc showed props_json/confidence/provenance_id/valid_*_ts on the node row; that pre-refactor schema has been superseded by the topology/properties split and the layer boundary.)

Typed edge tables carry a confidence: Float64 and a provenance_uuid column (see storage.md). These are knowledge-layer references persisted alongside the edge for locality; the values are produced and interpreted by the knowledge layer (provenance/confidence policy), not by graph-native execution.

Memory Model¶

Arrow is used as the internal execution currency:

Columnar RecordBatch — the unit of data flowing between operators
C Data Interface — zero-copy in-process sharing across Rust and Python
C Stream Interface — batch readers for streaming results
Arrow IPC — serialized stream for cross-process use (Node, Swift, Kotlin)

For Swift and Kotlin, Arrow IPC bytes (Vec<u8> / ByteArray) cross the UniFFI boundary. Callers deserialise with their platform Arrow library or the built-in asDictionary() / asList() convenience methods on GraphForgeResult.

Provenance and Confidence¶

GraphForge tracks two orthogonal concerns:

Provenance is structural and exact:

provenance_events  (provenance_id, kind, source_ref, ingest_run_id, query_id, created_at)
provenance_inputs  (provenance_id, input_provenance_id, role, weight_local)

Confidence is pluggable. The default policy is conservative_min:

Operation	Default policy
Base facts	Explicit `confidence ∈ [0,1]`, default `1.0`
Conjunction (MATCH joins, path composition)	`min(inputs)`
Union-like alternatives	`max(inputs)`
Aggregation	Explicit per-aggregate rule
Inference / materialization	Records `confidence_model` + `rule_id`

Error Handling¶

Layer	Tool	Approach
Public Rust API	`thiserror`	Typed error enums: `ParseError`, `BindError`, `OntologyError`, `StorageError`, `ExecutionError`
Rust internals	`anyhow`	Contextual error accumulation
Python binding	`pyo3` exception mapping	Each Rust error category → distinct Python exception class
Node binding	`napi-rs` error conversion	`Error` with `code`, `message`, and structured `details`
Swift binding	UniFFI error interface	`GraphForgeError` Swift enum with `LocalizedError` conformance
Kotlin binding	UniFFI error interface	`GfException` sealed class hierarchy

References¶

Architecture Overview — workspace layout and unified API
Algorithm Verbs — full algorithm catalog
AST & Planning — compiler pipeline and Graph IR
Storage — StorageProvider trait, Parquet provider
ADR 0002: Rust Core — DataFusion and Arrow choice rationale