GraphForge Architecture Overview¶

Status: Rust Core — active development on main Last Updated: 2026-06-07

Implementation status legend (used across architecture docs): Shipped = implemented and tested on main; Partially built = some paths real, others stubbed; Designed = specified, not yet implemented. The graph layer's Cypher read/write pipeline and traversal are Shipped; the workbench verbs and the knowledge layer (provenance, confidence, epistemic model) are Designed and owned by dedicated v0.5.0 milestones.

Executive Summary¶

GraphForge is a Knowledge Analysis Workbench — not a graph database or a graph analytics engine. It optimizes for analyst workflows that begin with uncertainty, discover structure over time, and progressively formalize that structure into ontology, workflows, and repeatable analysis.

The project (not the graph) is the primary unit of work:

Project = Knowledge Graph + Documents + Provenance + Embeddings + Workflows + Artifacts + Sync State

The internal implementation is a Rust core with multi-language bindings. The v0.5.0 release introduces a unified API and a compiler pipeline (DataFusion-backed execution, Arrow as the stable in-memory and FFI contract).

The Python 0.4.x codebase is the reference implementation. All Rust work targets main.

Architecture Principles¶

Arrow is the wire contract — results cross language boundaries as Arrow RecordBatch streams; no GraphForge-specific buffer protocol
GraphForge owns the semantics — the Cypher compiler, ontology, and Graph IR live in GraphForge-owned Rust crates; no storage provider or binding becomes the semantic owner
DataFusion is the execution backbone — GraphForge extends DataFusion with custom graph operators rather than writing a full executor from scratch
Unified result contract — all methods return Arrow Tables; no surface returns a bespoke result type
Correctness over performance — strict openCypher TCK compliance remains the primary constraint
Ontology is progressive, not required — GraphForge supports three modes: exploratory (no ontology required, all labels accepted), advisory (ontology present, violations are warnings), and strict (ontology enforced, violations are errors). Exploratory analysis is a first-class workflow. See ADR 0004.
Three layers, clean boundaries — graph concerns, knowledge concerns, and workbench concerns are separated; the graph layer stays graph-native and never absorbs the others. See the next section and ADR 0006.
Preserve the evolution of understanding — GraphForge records not just the current state of knowledge but how it evolved: competing hypotheses, superseded conclusions, evidence, and reasoning are preserved, never destroyed. See ADR 0007.

Layered Architecture¶

GraphForge is a knowledge analysis workbench, not just a graph engine. Its architecture separates three layers with strict boundaries (ADR 0006). Lower layers never depend on higher ones, and — critically — the graph layer never absorbs knowledge or workbench concerns.

┌───────────────────────────────────────────────────────────────────────┐
│  WORKBENCH LAYER                                                        │
│  forge.rank / cluster / paths / analyze / similar / find · search ·     │
│  workflows / recipes · exploration · project portability               │
│  — consumes the layers below; holds NO graph-semantic state             │
├───────────────────────────────────────────────────────────────────────┤
│  KNOWLEDGE LAYER                                                        │
│  provenance · confidence · evidence · ontology-inference lineage ·      │
│  epistemic assertions + status + supersession + valid-time (ADR 0007)   │
│  — attaches to graph objects BY UUID REFERENCE ONLY                     │
├───────────────────────────────────────────────────────────────────────┤
│  GRAPH LAYER                                                            │
│  nodes · edges · properties · traversal · pattern matching ·            │
│  graph algorithms · adjacency index (ADR 0005)                          │
│  — graph-native; surrogate-keyed execution; UUID identity              │
│  — stores NO knowledge or workbench semantics                          │
└───────────────────────────────────────────────────────────────────────┘

Layer	Owns	Where it lives
Graph	Nodes, edges, properties, traversal, pattern matching, graph algorithms, adjacency	`gf-cypher`, `gf-ir`, `gf-rel`, `gf-exec`, `gf-storage` (`topology/`, `properties/`, `indexes/adjacency/`)
Knowledge	Provenance, confidence, evidence, epistemic assertions/status/supersession/valid-time	`gf-provenance` + knowledge module; `provenance/`, `knowledge/`
Workbench	Analyst verbs, hybrid search, workflows, exploration, project envelope	`gf-api`, bindings, search modules

Boundary rule: knowledge attaches to the graph by UUID reference, never by embedding columns on graph tables. Cypher/traversal/algorithms read only the graph layer, so the presence or absence of knowledge data never changes a graph-native query result (a tested invariant). This keeps the traversal hot path lean and preserves the lightweight-embedded model.

The project (not the graph) is the unit of work, and the layers map onto the project envelope:

Project = Graph (topology + properties)          ← graph layer
        + Knowledge (provenance + confidence + evidence + epistemic assertions)  ← knowledge layer
        + Workbench assets (documents + embeddings + indexes + workflows + artifacts)  ← workbench layer
        + Sync State

High-Level Architecture¶

┌──────────────────────────────────────────────────────────────────────────────────┐
│                               GraphForge API                                      │
│   Python: PyO3/maturin  ·  Node: napi-rs  ·  Swift/Kotlin: UniFFI               │
│                                                                                   │
│  forge.execute(…)  forge.rank(…)   forge.cluster(…)  forge.paths(…)              │
│  forge.analyze(…)  forge.similar(…)  forge.find(…)                               │
└──────────────────────────────────────────────────────────────────────────────────┘
          │                        │                  │                  │
          ▼                        └──────────────────┴──────────────────┘
┌──────────────────┐                                  │
│   Cypher Path    │                                  ▼
│                  │              ┌──────────────────────────────────┐
│  RD+Pratt parser │              │   Analyst Verbs                  │
│       ↓          │              │   rank / cluster / paths /       │
│  Binder +        │              │   analyze / similar / find       │
│  ontology        │              │   — bypass parser/planner        │
│       ↓          │              │   Export adjacency or index      │
│  Graph IR        │              │   Dispatch algorithm or search   │
│       ↓          │              │   Produce scored Arrow batches   │
│  DataFusion      │◄─────────────────────────────────┘
│       ↓          │         (all paths converge to DataFusion)
│  Arrow batches   │
└──────────────────┘
          │
          ▼
┌──────────────────────────────────────┐
│          Storage Provider             │
│              Parquet                  │
└──────────────────────────────────────┘

Rust Workspace Layout¶

crates/
  gf-core/             # public Rust engine facade
  gf-ast/              # AST + spans + syntax diagnostics
  gf-cypher/           # hand-written lexer + recursive-descent/Pratt parser
  gf-ontology/         # runtime ontology model, validation, migration
  gf-ir/               # graph IR + serde DTOs
  gf-rel/              # graph IR → relational lowering
  gf-plan/             # DataFusion integration, optimizer rules, custom nodes
  gf-exec/             # execution session, algorithms, search, result streaming
  gf-storage/          # StorageProvider trait + Parquet provider
  gf-io/               # CSV/JSON/Parquet/IPC sinks and loaders
  gf-provenance/       # lineage and confidence models
  gf-bindings-py/      # PyO3 + maturin Python binding
  gf-bindings-node/    # napi-rs Node binding
  gf-bindings-uniffi/  # UniFFI shared binding (Swift + Kotlin)
  gf-cli/              # command-line interface

bindings/
  swift/               # Swift Package Manager package + XCFramework
  kotlin/              # Gradle/Kotlin package + JAR

Feature flags: default = ["datafusion", "parquet"]. Optional: polars, python, node, swift, kotlin.

Three Internal Representations¶

The Cypher compiler maintains three distinct representations — they are not interchangeable:

Representation	Purpose	Stability
AST	Syntax-faithful, span-rich, close to Cypher source text	Internal only — no API guarantee
Graph IR	Semantic and graph-native; the stable plan contract	Semver-versioned
DataFusion logical plan	Relational/physical execution	DataFusion's own contract

The AST is not the cross-language compatibility surface. The stable boundary is the Graph IR envelope and the Arrow result contract.

See AST & Planning for the full compiler pipeline.

Arrow as the Data Contract¶

All execution results cross language boundaries as Arrow RecordBatch streams. Arrow provides:

A stable, language-independent columnar memory format
Zero-copy in-process exchange via the C Data Interface
Python interchange via the PyCapsule Interface (no hard PyArrow dependency required)
Node consumption via Arrow IPC and tableFromIPC in Apache Arrow JS
Swift and Kotlin consumption via Arrow IPC (Vec<u8> / ByteArray over UniFFI)

Arrow schema metadata carries GraphForge-specific annotations:

graphforge.ir_version = "1.0.0"
graphforge.ontology_version = "core-2026.05"
graphforge.result_kind = "node_table"
graphforge.confidence_policy = "conservative_minmax"
graphforge.query_id = "01J..."

These annotations survive IPC serialization and Parquet round-trips, which is why Arrow is the correct contract rather than a Polars or Python-specific result type.

Multi-Language Bindings¶

Language	Mechanism	Crate / Package	Result contract
Rust	Native crate API	`gf-core`	`ExecutionResult { schema, batches, stats }`
Python	PyO3 + maturin	`gf-bindings-py`	`pyarrow.Table` or `RecordBatchReader`
Node	napi-rs	`gf-bindings-node`	Arrow IPC `Buffer` → `tableFromIPC(buf)`
Swift	UniFFI	`gf-bindings-uniffi` + `bindings/swift/`	Arrow IPC `Data` → `GraphForgeResult`
Kotlin	UniFFI	`gf-bindings-uniffi` + `bindings/kotlin/`	Arrow IPC `ByteArray` → `GraphForgeResult`

The architectural rule: never let a binding become the semantic owner of the language. Bindings translate results; the Rust core owns the semantics.

Swift and Kotlin bindings are generated by UniFFI from a shared UDL interface definition in gf-bindings-uniffi. Both languages receive Arrow IPC bytes and deserialise with their platform Arrow library. See ADR 0002 for the binding strategy rationale.

Migration Strategy¶

All Rust work targets main. Merge gates required before the v0.5.0 release:

Gate	Requirement
Parser parity	Existing corpus + syntax goldens pass against RD+Pratt parser
OpenCypher conformance	Agreed TCK subset passes
Ontology runtime	Load/validate/migrate round-trips pass
Data contract	Arrow/Parquet/IPC round-trips pass
Provider baseline	Parquet provider passes core semantics
Binding baseline	Python, Node, Swift, and Kotlin can execute and consume Arrow results
Observability	`explain`, query IDs, provenance IDs, structured errors available

References¶

AST & Planning — recursive-descent/Pratt parser, three-tier IR, compiler pipeline
Algorithm Verbs — full algorithm catalog across rank/cluster/paths/analyze/similar
Execution Model — DataFusion integration, custom graph operators, Arrow result streams
Storage — StorageProvider trait, Parquet provider
ADR 0002: Rust Core — Decision record for the Rust refactor
ADR 0003: LR(1) Grammar — Parser algorithm decision
ADR 0004: Progressive Ontology — exploration-first ontology modes
ADR 0005: Adjacency Index — graph-layer derived traversal accelerator
ADR 0006: Layered Architecture — graph / knowledge / workbench boundaries
ADR 0007: Epistemic Model — preserving the evolution of understanding
Roadmap — Milestones and timeline