Skip to content

Execution Model

Status: Rust Core — active development on main Last Updated: 2026-06-07

Implementation status legend (used across architecture docs): Shipped = implemented and tested on main; Partially built = some paths real, others stubbed; Designed = specified, not yet implemented. As of this writing: the Cypher read pipeline, CREATE, variable-length traversal, OPTIONAL MATCH, and UNWIND are Shipped; the analyst verbs (rank/cluster/paths/ analyze/similar/find) are Designed (present in the Python 0.4.x shim; NotImplemented in the Rust core); provenance and confidence are Designed (schema defined; recorded/propagated by the Knowledge Layer Foundation milestone).


Overview

GraphForge uses DataFusion as the primary execution backbone. DataFusion provides the logical/physical planning pipeline, optimizer, table providers, and Arrow-native batch execution. GraphForge extends it with custom physical plan nodes for operations that require graph-native semantics.

The v0.5.0 unified API exposes seven entry points — execute, rank, cluster, paths, analyze, similar, and find — all producing Arrow Tables. execute travels the full Cypher compiler pipeline; the analyst verbs bypass the parser and planner but converge on the same DataFusion execution layer. execute is Shipped; the six analyst verbs are Designed in the Rust core (see status legend above). These verbs are workbench-layer surfaces (ADR 0006): they consume the graph and knowledge layers and hold no graph-semantic state.

Polars is used as a storage-layer companion for IO and sinks (CSV, JSON, Parquet, IPC). It is not the semantic owner of query execution.


Why DataFusion

DataFusion exposes the extension points GraphForge needs:

  • Custom LogicalPlan nodes — graph operators slot in alongside relational operators
  • Custom ExecutionPlan nodes — variable-length path expansion and provenance-aware joins as physical operators
  • Custom TableProvider — Parquet as a pluggable scan source
  • Custom optimizer rules — predicate pushdown and label-constraint hoisting
  • Custom QueryPlanner — GraphForge IR is lowered to DataFusion's logical plan rather than parsed SQL

DataFusion's Arrow-native execution means results already arrive as RecordBatch streams — no additional conversion is needed before returning to Python, Node, Swift, or Kotlin.


Execution Pipeline

GraphPlan (Graph IR)
      ↓
┌────────────────────────────┐
│  gf-rel: relational lower  │
│  Simple ops → LogicalPlan  │
│  Graph ops → custom nodes  │
└────────────────────────────┘
      ↓
┌────────────────────────────┐
│  DataFusion Analyzer       │
│  DataFusion Optimizer      │
│  (predicate pushdown, etc) │
└────────────────────────────┘
      ↓
┌────────────────────────────┐
│  DataFusion Physical Plan  │
│  gf custom PhysicalNodes   │
└────────────────────────────┘
      ↓
  SendableRecordBatchStream   ← Arrow columnar batches
      ↓
  gf-bindings-py / gf-bindings-node / gf-bindings-uniffi (Swift, Kotlin) / Rust API

Custom Graph Execution Nodes

GraphForge registers custom ExecutionPlan implementations with DataFusion for operators that cannot be faithfully expressed as relational algebra:

Node Description
VarLenExpand Iterative or recursive expansion for *min..max path patterns
OptionalMatch Left-join semantics with Cypher null-shaping (distinct from SQL LEFT JOIN)
PathUnique Path isomorphism/homomorphism enforcement
ProvenanceSemijoin Semijoin with confidence propagation
OntologyInfer Transitive/symmetric closure materialization
GraphMerge MERGE upsert with write-path locking

All other operators (scan, filter, project, aggregate, sort, limit) run through standard DataFusion physical nodes.


Adjacency-Backed Execution

Traversal is fundamentally adjacency-oriented. GraphForge maintains a derived, rebuildable adjacency index (CSR, surrogate-keyed) under indexes/adjacency/ (ADR 0005, storage.md §Derived Indexes). Both execution paths consume it through a single AdjacencyProvider abstraction:

  • Cypher traversal pathExpandExec / VarLenExpandExec ask the provider for a node's neighbors instead of rebuilding adjacency from a full edge scan per query.
  • Analyst-verb pathexport_adjacency is a thin adapter from the same provider to the AdjacencyGraph consumed by the AlgorithmBackend trait.
Node Description
VarLenExpandExec Iterative BFS over the AdjacencyProvider for *min..max patterns (replaces the per-query in-memory adjacency build)
ExpandExec Adjacency-backed single-hop expansion (#763): chosen at lowering time when the provider reports a hit for a typed relation (any direction; undirected wraps in DISTINCT, mirroring the join path's union+distinct). Exploratory single-hop and uncovered patterns keep the DataFusion join chain.

Selection is a planner choice, not an IR change. The Graph IR is unchanged: variable-length traversal is still encoded on Expand { …, min_hops, max_hops }. A lowering rule selects an adjacency-backed physical node when the provider covers the relation type + direction, and falls back to the DataFusion hash-join path otherwise. Both paths produce identical results; only speed differs.

The index is optional and never authoritative: a stale or missing index falls back to scan-and-build with identical output (the topology_generation counter detects staleness). The canonical Parquet topology is always the source of truth.

explain annotates each traversal node with adjacency=hit | miss | building, so plan inspection shows whether the accelerator was used (ExecutionSession::explain_physical renders the physical plan without executing it). One PersistentAdjacencyProvider lives per session (= per query) with an interior per-(relation, direction) cache, so multi-expand queries load each CSR once; lifting the provider to the long-lived facade for cross-query caching is a planned follow-on.


Analyst Verbs

rank(), cluster(), paths(), analyze(), similar(), and find() bypass the Cypher parser and planner entirely. They accept structured arguments, build their own DataFusion sub-plans, and return Arrow Tables through the same result contract as execute().

Method Bypasses Uses Returns
rank(label, by=…) Parser, planner Adjacency export → algorithm dispatch → DataFusion project node properties + score: Float64
cluster(label, by=…) Parser, planner Adjacency export → community algorithm → DataFusion project node properties + community_id: Int64
paths(source, target, by=…) Parser, planner Adjacency export → path algorithm → DataFusion project source_id, target_id, cost, path
analyze(label, by=…) Parser, planner Adjacency export → structural analysis → DataFusion project varies by algorithm
similar(label, by=…) Parser, planner Adjacency export → similarity → DataFusion project node1_id, node2_id, similarity
find(query, vector=…) Parser, planner FTS / vector index → RRF fusion → DataFusion project node properties + score + matched_on

Key design facts:

  • No separate result type. All analyst verbs return Arrow Tables exactly as execute() does.
  • write_property is opt-in mutation. rank(), cluster(), and paths() are read-only by default.
  • find() indexes lazily. On the first call for a given label, a full-text and/or vector index is built. Call forge.index(label, …) explicitly to control when indexing occurs.
  • All paths converge at DataFusion. Adjacency export and scoring stages produce Arrow data that DataFusion projects and returns as RecordBatch streams.

Full algorithm catalog: Algorithm Verbs.


Arrow Result Contract

All results are returned as Arrow RecordBatch streams. The schema for query results carries GraphForge metadata:

fn result_schema(fields: Vec<Field>, query_id: &str, ontology_ver: &str) -> Schema {
    let mut meta = HashMap::new();
    meta.insert("graphforge.ir_version".into(), "1.0.0".into());
    meta.insert("graphforge.ontology_version".into(), ontology_ver.into());
    meta.insert("graphforge.query_id".into(), query_id.into());
    meta.insert("graphforge.result_kind".into(), "query_result".into());
    Schema::new_with_metadata(fields, meta)
}

Topology node schema (the shipped layout — see crates/gf-storage/src/schemas.rs):

// topology/nodes.parquet — identity + type only (the GRAPH LAYER hot path)
let node_schema = Schema::new(vec![
    Field::new("node_uuid",  DataType::FixedSizeBinary(16), false), // UUIDv7 — canonical identity
    Field::new("node_id",    DataType::UInt64,  false),             // local surrogate — join key
    Field::new("type_id",    DataType::UInt32,  false),             // ontology entity type
    Field::new("created_at", DataType::Timestamp(TimeUnit::Microsecond, Some("UTC".into())), false),
    Field::new("updated_at", DataType::Timestamp(TimeUnit::Microsecond, Some("UTC".into())), false),
]);

Layer note (ADR 0006). The topology node row holds only identity and type. Property values live in properties/ENTITY_TYPE.parquet (graph layer, warm path). Confidence, provenance, and valid-time are knowledge-layer concerns (ADR 0006) — they are recorded in the knowledge layer and attached to graph objects by node_uuid/edge_uuid, not as columns on topology tables. Bitemporal valid-time lives on epistemic assertions (ADR 0007), not on raw nodes. (Earlier drafts of this doc showed props_json/confidence/provenance_id/valid_*_ts on the node row; that pre-refactor schema has been superseded by the topology/properties split and the layer boundary.)

Typed edge tables carry a confidence: Float64 and a provenance_uuid column (see storage.md). These are knowledge-layer references persisted alongside the edge for locality; the values are produced and interpreted by the knowledge layer (provenance/confidence policy), not by graph-native execution.


Memory Model

Arrow is used as the internal execution currency:

  • Columnar RecordBatch — the unit of data flowing between operators
  • C Data Interface — zero-copy in-process sharing across Rust and Python
  • C Stream Interface — batch readers for streaming results
  • Arrow IPC — serialized stream for cross-process use (Node, Swift, Kotlin)

For Swift and Kotlin, Arrow IPC bytes (Vec<u8> / ByteArray) cross the UniFFI boundary. Callers deserialise with their platform Arrow library or the built-in asDictionary() / asList() convenience methods on GraphForgeResult.


Provenance and Confidence

GraphForge tracks two orthogonal concerns:

Provenance is structural and exact:

provenance_events  (provenance_id, kind, source_ref, ingest_run_id, query_id, created_at)
provenance_inputs  (provenance_id, input_provenance_id, role, weight_local)

Confidence is pluggable. The default policy is conservative_min:

Operation Default policy
Base facts Explicit confidence ∈ [0,1], default 1.0
Conjunction (MATCH joins, path composition) min(inputs)
Union-like alternatives max(inputs)
Aggregation Explicit per-aggregate rule
Inference / materialization Records confidence_model + rule_id

Error Handling

Layer Tool Approach
Public Rust API thiserror Typed error enums: ParseError, BindError, OntologyError, StorageError, ExecutionError
Rust internals anyhow Contextual error accumulation
Python binding pyo3 exception mapping Each Rust error category → distinct Python exception class
Node binding napi-rs error conversion Error with code, message, and structured details
Swift binding UniFFI error interface GraphForgeError Swift enum with LocalizedError conformance
Kotlin binding UniFFI error interface GfException sealed class hierarchy

References