Execution Model¶
Status: Rust Core — active development on main
Last Updated: 2026-06-07
Implementation status legend (used across architecture docs): Shipped = implemented and tested on
main; Partially built = some paths real, others stubbed; Designed = specified, not yet implemented. As of this writing: the Cypher read pipeline, CREATE, variable-length traversal, OPTIONAL MATCH, and UNWIND are Shipped; the analyst verbs (rank/cluster/paths/analyze/similar/find) are Designed (present in the Python 0.4.x shim;NotImplementedin the Rust core); provenance and confidence are Designed (schema defined; recorded/propagated by the Knowledge Layer Foundation milestone).
Overview¶
GraphForge uses DataFusion as the primary execution backbone. DataFusion provides the logical/physical planning pipeline, optimizer, table providers, and Arrow-native batch execution. GraphForge extends it with custom physical plan nodes for operations that require graph-native semantics.
The v0.5.0 unified API exposes seven entry points — execute, rank, cluster, paths,
analyze, similar, and find — all producing Arrow Tables. execute travels the full
Cypher compiler pipeline; the analyst verbs bypass the parser and planner but converge on
the same DataFusion execution layer. execute is Shipped; the six analyst verbs are
Designed in the Rust core (see status legend above). These verbs are workbench-layer
surfaces (ADR 0006): they consume the graph and knowledge
layers and hold no graph-semantic state.
Polars is used as a storage-layer companion for IO and sinks (CSV, JSON, Parquet, IPC). It is not the semantic owner of query execution.
Why DataFusion¶
DataFusion exposes the extension points GraphForge needs:
- Custom
LogicalPlannodes — graph operators slot in alongside relational operators - Custom
ExecutionPlannodes — variable-length path expansion and provenance-aware joins as physical operators - Custom
TableProvider— Parquet as a pluggable scan source - Custom optimizer rules — predicate pushdown and label-constraint hoisting
- Custom
QueryPlanner— GraphForge IR is lowered to DataFusion's logical plan rather than parsed SQL
DataFusion's Arrow-native execution means results already arrive as RecordBatch streams —
no additional conversion is needed before returning to Python, Node, Swift, or Kotlin.
Execution Pipeline¶
GraphPlan (Graph IR)
↓
┌────────────────────────────┐
│ gf-rel: relational lower │
│ Simple ops → LogicalPlan │
│ Graph ops → custom nodes │
└────────────────────────────┘
↓
┌────────────────────────────┐
│ DataFusion Analyzer │
│ DataFusion Optimizer │
│ (predicate pushdown, etc) │
└────────────────────────────┘
↓
┌────────────────────────────┐
│ DataFusion Physical Plan │
│ gf custom PhysicalNodes │
└────────────────────────────┘
↓
SendableRecordBatchStream ← Arrow columnar batches
↓
gf-bindings-py / gf-bindings-node / gf-bindings-uniffi (Swift, Kotlin) / Rust API
Custom Graph Execution Nodes¶
GraphForge registers custom ExecutionPlan implementations with DataFusion for operators
that cannot be faithfully expressed as relational algebra:
| Node | Description |
|---|---|
VarLenExpand |
Iterative or recursive expansion for *min..max path patterns |
OptionalMatch |
Left-join semantics with Cypher null-shaping (distinct from SQL LEFT JOIN) |
PathUnique |
Path isomorphism/homomorphism enforcement |
ProvenanceSemijoin |
Semijoin with confidence propagation |
OntologyInfer |
Transitive/symmetric closure materialization |
GraphMerge |
MERGE upsert with write-path locking |
All other operators (scan, filter, project, aggregate, sort, limit) run through standard DataFusion physical nodes.
Adjacency-Backed Execution¶
Traversal is fundamentally adjacency-oriented. GraphForge maintains a derived, rebuildable
adjacency index (CSR, surrogate-keyed) under indexes/adjacency/
(ADR 0005, storage.md §Derived Indexes). Both
execution paths consume it through a single AdjacencyProvider abstraction:
- Cypher traversal path —
ExpandExec/VarLenExpandExecask the provider for a node's neighbors instead of rebuilding adjacency from a full edge scan per query. - Analyst-verb path —
export_adjacencyis a thin adapter from the same provider to theAdjacencyGraphconsumed by theAlgorithmBackendtrait.
| Node | Description |
|---|---|
VarLenExpandExec |
Iterative BFS over the AdjacencyProvider for *min..max patterns (replaces the per-query in-memory adjacency build) |
ExpandExec |
Adjacency-backed single-hop expansion (#763): chosen at lowering time when the provider reports a hit for a typed relation (any direction; undirected wraps in DISTINCT, mirroring the join path's union+distinct). Exploratory single-hop and uncovered patterns keep the DataFusion join chain. |
Selection is a planner choice, not an IR change. The Graph IR is unchanged: variable-length
traversal is still encoded on Expand { …, min_hops, max_hops }. A lowering rule selects an
adjacency-backed physical node when the provider covers the relation type + direction, and falls
back to the DataFusion hash-join path otherwise. Both paths produce identical results; only
speed differs.
The index is optional and never authoritative: a stale or missing index falls back to
scan-and-build with identical output (the topology_generation counter detects staleness). The
canonical Parquet topology is always the source of truth.
explain annotates each traversal node with adjacency=hit | miss | building, so plan
inspection shows whether the accelerator was used (ExecutionSession::explain_physical
renders the physical plan without executing it). One PersistentAdjacencyProvider lives per
session (= per query) with an interior per-(relation, direction) cache, so multi-expand
queries load each CSR once; lifting the provider to the long-lived facade for cross-query
caching is a planned follow-on.
Analyst Verbs¶
rank(), cluster(), paths(), analyze(), similar(), and find() bypass the Cypher
parser and planner entirely. They accept structured arguments, build their own DataFusion
sub-plans, and return Arrow Tables through the same result contract as execute().
| Method | Bypasses | Uses | Returns |
|---|---|---|---|
rank(label, by=…) |
Parser, planner | Adjacency export → algorithm dispatch → DataFusion project | node properties + score: Float64 |
cluster(label, by=…) |
Parser, planner | Adjacency export → community algorithm → DataFusion project | node properties + community_id: Int64 |
paths(source, target, by=…) |
Parser, planner | Adjacency export → path algorithm → DataFusion project | source_id, target_id, cost, path |
analyze(label, by=…) |
Parser, planner | Adjacency export → structural analysis → DataFusion project | varies by algorithm |
similar(label, by=…) |
Parser, planner | Adjacency export → similarity → DataFusion project | node1_id, node2_id, similarity |
find(query, vector=…) |
Parser, planner | FTS / vector index → RRF fusion → DataFusion project | node properties + score + matched_on |
Key design facts:
- No separate result type. All analyst verbs return Arrow Tables exactly as
execute()does. write_propertyis opt-in mutation.rank(),cluster(), andpaths()are read-only by default.find()indexes lazily. On the first call for a given label, a full-text and/or vector index is built. Callforge.index(label, …)explicitly to control when indexing occurs.- All paths converge at DataFusion. Adjacency export and scoring stages produce Arrow data that DataFusion projects and returns as
RecordBatchstreams.
Full algorithm catalog: Algorithm Verbs.
Arrow Result Contract¶
All results are returned as Arrow RecordBatch streams. The schema for query results carries GraphForge metadata:
fn result_schema(fields: Vec<Field>, query_id: &str, ontology_ver: &str) -> Schema {
let mut meta = HashMap::new();
meta.insert("graphforge.ir_version".into(), "1.0.0".into());
meta.insert("graphforge.ontology_version".into(), ontology_ver.into());
meta.insert("graphforge.query_id".into(), query_id.into());
meta.insert("graphforge.result_kind".into(), "query_result".into());
Schema::new_with_metadata(fields, meta)
}
Topology node schema (the shipped layout — see crates/gf-storage/src/schemas.rs):
// topology/nodes.parquet — identity + type only (the GRAPH LAYER hot path)
let node_schema = Schema::new(vec![
Field::new("node_uuid", DataType::FixedSizeBinary(16), false), // UUIDv7 — canonical identity
Field::new("node_id", DataType::UInt64, false), // local surrogate — join key
Field::new("type_id", DataType::UInt32, false), // ontology entity type
Field::new("created_at", DataType::Timestamp(TimeUnit::Microsecond, Some("UTC".into())), false),
Field::new("updated_at", DataType::Timestamp(TimeUnit::Microsecond, Some("UTC".into())), false),
]);
Layer note (ADR 0006). The topology node row holds only identity and type. Property values live in
properties/ENTITY_TYPE.parquet(graph layer, warm path). Confidence, provenance, and valid-time are knowledge-layer concerns (ADR 0006) — they are recorded in the knowledge layer and attached to graph objects bynode_uuid/edge_uuid, not as columns on topology tables. Bitemporal valid-time lives on epistemic assertions (ADR 0007), not on raw nodes. (Earlier drafts of this doc showedprops_json/confidence/provenance_id/valid_*_tson the node row; that pre-refactor schema has been superseded by the topology/properties split and the layer boundary.)
Typed edge tables carry a confidence: Float64 and a provenance_uuid column (see storage.md).
These are knowledge-layer references persisted alongside the edge for locality; the values are
produced and interpreted by the knowledge layer (provenance/confidence policy), not by graph-native
execution.
Memory Model¶
Arrow is used as the internal execution currency:
- Columnar RecordBatch — the unit of data flowing between operators
- C Data Interface — zero-copy in-process sharing across Rust and Python
- C Stream Interface — batch readers for streaming results
- Arrow IPC — serialized stream for cross-process use (Node, Swift, Kotlin)
For Swift and Kotlin, Arrow IPC bytes (Vec<u8> / ByteArray) cross the UniFFI boundary.
Callers deserialise with their platform Arrow library or the built-in asDictionary() /
asList() convenience methods on GraphForgeResult.
Provenance and Confidence¶
GraphForge tracks two orthogonal concerns:
Provenance is structural and exact:
provenance_events (provenance_id, kind, source_ref, ingest_run_id, query_id, created_at)
provenance_inputs (provenance_id, input_provenance_id, role, weight_local)
Confidence is pluggable. The default policy is conservative_min:
| Operation | Default policy |
|---|---|
| Base facts | Explicit confidence ∈ [0,1], default 1.0 |
| Conjunction (MATCH joins, path composition) | min(inputs) |
| Union-like alternatives | max(inputs) |
| Aggregation | Explicit per-aggregate rule |
| Inference / materialization | Records confidence_model + rule_id |
Error Handling¶
| Layer | Tool | Approach |
|---|---|---|
| Public Rust API | thiserror |
Typed error enums: ParseError, BindError, OntologyError, StorageError, ExecutionError |
| Rust internals | anyhow |
Contextual error accumulation |
| Python binding | pyo3 exception mapping |
Each Rust error category → distinct Python exception class |
| Node binding | napi-rs error conversion |
Error with code, message, and structured details |
| Swift binding | UniFFI error interface | GraphForgeError Swift enum with LocalizedError conformance |
| Kotlin binding | UniFFI error interface | GfException sealed class hierarchy |
References¶
- Architecture Overview — workspace layout and unified API
- Algorithm Verbs — full algorithm catalog
- AST & Planning — compiler pipeline and Graph IR
- Storage — StorageProvider trait, Parquet provider
- ADR 0002: Rust Core — DataFusion and Arrow choice rationale