Skip to content

GraphForge Architecture Overview

Status: Rust Core — active development on main Last Updated: 2026-06-07

Implementation status legend (used across architecture docs): Shipped = implemented and tested on main; Partially built = some paths real, others stubbed; Designed = specified, not yet implemented. The graph layer's Cypher read/write pipeline and traversal are Shipped; the workbench verbs and the knowledge layer (provenance, confidence, epistemic model) are Designed and owned by dedicated v0.5.0 milestones.


Executive Summary

GraphForge is a Knowledge Analysis Workbench — not a graph database or a graph analytics engine. It optimizes for analyst workflows that begin with uncertainty, discover structure over time, and progressively formalize that structure into ontology, workflows, and repeatable analysis.

The project (not the graph) is the primary unit of work:

Project = Knowledge Graph + Documents + Provenance + Embeddings + Workflows + Artifacts + Sync State

The internal implementation is a Rust core with multi-language bindings. The v0.5.0 release introduces a unified API and a compiler pipeline (DataFusion-backed execution, Arrow as the stable in-memory and FFI contract).

The Python 0.4.x codebase is the reference implementation. All Rust work targets main.


Architecture Principles

  1. Arrow is the wire contract — results cross language boundaries as Arrow RecordBatch streams; no GraphForge-specific buffer protocol
  2. GraphForge owns the semantics — the Cypher compiler, ontology, and Graph IR live in GraphForge-owned Rust crates; no storage provider or binding becomes the semantic owner
  3. DataFusion is the execution backbone — GraphForge extends DataFusion with custom graph operators rather than writing a full executor from scratch
  4. Unified result contract — all methods return Arrow Tables; no surface returns a bespoke result type
  5. Correctness over performance — strict openCypher TCK compliance remains the primary constraint
  6. Ontology is progressive, not required — GraphForge supports three modes: exploratory (no ontology required, all labels accepted), advisory (ontology present, violations are warnings), and strict (ontology enforced, violations are errors). Exploratory analysis is a first-class workflow. See ADR 0004.
  7. Three layers, clean boundaries — graph concerns, knowledge concerns, and workbench concerns are separated; the graph layer stays graph-native and never absorbs the others. See the next section and ADR 0006.
  8. Preserve the evolution of understanding — GraphForge records not just the current state of knowledge but how it evolved: competing hypotheses, superseded conclusions, evidence, and reasoning are preserved, never destroyed. See ADR 0007.

Layered Architecture

GraphForge is a knowledge analysis workbench, not just a graph engine. Its architecture separates three layers with strict boundaries (ADR 0006). Lower layers never depend on higher ones, and — critically — the graph layer never absorbs knowledge or workbench concerns.

┌───────────────────────────────────────────────────────────────────────┐
│  WORKBENCH LAYER                                                        │
│  forge.rank / cluster / paths / analyze / similar / find · search ·     │
│  workflows / recipes · exploration · project portability               │
│  — consumes the layers below; holds NO graph-semantic state             │
├───────────────────────────────────────────────────────────────────────┤
│  KNOWLEDGE LAYER                                                        │
│  provenance · confidence · evidence · ontology-inference lineage ·      │
│  epistemic assertions + status + supersession + valid-time (ADR 0007)   │
│  — attaches to graph objects BY UUID REFERENCE ONLY                     │
├───────────────────────────────────────────────────────────────────────┤
│  GRAPH LAYER                                                            │
│  nodes · edges · properties · traversal · pattern matching ·            │
│  graph algorithms · adjacency index (ADR 0005)                          │
│  — graph-native; surrogate-keyed execution; UUID identity              │
│  — stores NO knowledge or workbench semantics                          │
└───────────────────────────────────────────────────────────────────────┘
Layer Owns Where it lives
Graph Nodes, edges, properties, traversal, pattern matching, graph algorithms, adjacency gf-cypher, gf-ir, gf-rel, gf-exec, gf-storage (topology/, properties/, indexes/adjacency/)
Knowledge Provenance, confidence, evidence, epistemic assertions/status/supersession/valid-time gf-provenance + knowledge module; provenance/, knowledge/
Workbench Analyst verbs, hybrid search, workflows, exploration, project envelope gf-api, bindings, search modules

Boundary rule: knowledge attaches to the graph by UUID reference, never by embedding columns on graph tables. Cypher/traversal/algorithms read only the graph layer, so the presence or absence of knowledge data never changes a graph-native query result (a tested invariant). This keeps the traversal hot path lean and preserves the lightweight-embedded model.

The project (not the graph) is the unit of work, and the layers map onto the project envelope:

Project = Graph (topology + properties)          ← graph layer
        + Knowledge (provenance + confidence + evidence + epistemic assertions)  ← knowledge layer
        + Workbench assets (documents + embeddings + indexes + workflows + artifacts)  ← workbench layer
        + Sync State

High-Level Architecture

┌──────────────────────────────────────────────────────────────────────────────────┐
│                               GraphForge API                                      │
│   Python: PyO3/maturin  ·  Node: napi-rs  ·  Swift/Kotlin: UniFFI               │
│                                                                                   │
│  forge.execute(…)  forge.rank(…)   forge.cluster(…)  forge.paths(…)              │
│  forge.analyze(…)  forge.similar(…)  forge.find(…)                               │
└──────────────────────────────────────────────────────────────────────────────────┘
          │                        │                  │                  │
          ▼                        └──────────────────┴──────────────────┘
┌──────────────────┐                                  │
│   Cypher Path    │                                  ▼
│                  │              ┌──────────────────────────────────┐
│  RD+Pratt parser │              │   Analyst Verbs                  │
│       ↓          │              │   rank / cluster / paths /       │
│  Binder +        │              │   analyze / similar / find       │
│  ontology        │              │   — bypass parser/planner        │
│       ↓          │              │   Export adjacency or index      │
│  Graph IR        │              │   Dispatch algorithm or search   │
│       ↓          │              │   Produce scored Arrow batches   │
│  DataFusion      │◄─────────────────────────────────┘
│       ↓          │         (all paths converge to DataFusion)
│  Arrow batches   │
└──────────────────┘
          │
          ▼
┌──────────────────────────────────────┐
│          Storage Provider             │
│              Parquet                  │
└──────────────────────────────────────┘

Rust Workspace Layout

crates/
  gf-core/             # public Rust engine facade
  gf-ast/              # AST + spans + syntax diagnostics
  gf-cypher/           # hand-written lexer + recursive-descent/Pratt parser
  gf-ontology/         # runtime ontology model, validation, migration
  gf-ir/               # graph IR + serde DTOs
  gf-rel/              # graph IR → relational lowering
  gf-plan/             # DataFusion integration, optimizer rules, custom nodes
  gf-exec/             # execution session, algorithms, search, result streaming
  gf-storage/          # StorageProvider trait + Parquet provider
  gf-io/               # CSV/JSON/Parquet/IPC sinks and loaders
  gf-provenance/       # lineage and confidence models
  gf-bindings-py/      # PyO3 + maturin Python binding
  gf-bindings-node/    # napi-rs Node binding
  gf-bindings-uniffi/  # UniFFI shared binding (Swift + Kotlin)
  gf-cli/              # command-line interface

bindings/
  swift/               # Swift Package Manager package + XCFramework
  kotlin/              # Gradle/Kotlin package + JAR

Feature flags: default = ["datafusion", "parquet"]. Optional: polars, python, node, swift, kotlin.


Three Internal Representations

The Cypher compiler maintains three distinct representations — they are not interchangeable:

Representation Purpose Stability
AST Syntax-faithful, span-rich, close to Cypher source text Internal only — no API guarantee
Graph IR Semantic and graph-native; the stable plan contract Semver-versioned
DataFusion logical plan Relational/physical execution DataFusion's own contract

The AST is not the cross-language compatibility surface. The stable boundary is the Graph IR envelope and the Arrow result contract.

See AST & Planning for the full compiler pipeline.


Arrow as the Data Contract

All execution results cross language boundaries as Arrow RecordBatch streams. Arrow provides:

  • A stable, language-independent columnar memory format
  • Zero-copy in-process exchange via the C Data Interface
  • Python interchange via the PyCapsule Interface (no hard PyArrow dependency required)
  • Node consumption via Arrow IPC and tableFromIPC in Apache Arrow JS
  • Swift and Kotlin consumption via Arrow IPC (Vec<u8> / ByteArray over UniFFI)

Arrow schema metadata carries GraphForge-specific annotations:

graphforge.ir_version = "1.0.0"
graphforge.ontology_version = "core-2026.05"
graphforge.result_kind = "node_table"
graphforge.confidence_policy = "conservative_minmax"
graphforge.query_id = "01J..."

These annotations survive IPC serialization and Parquet round-trips, which is why Arrow is the correct contract rather than a Polars or Python-specific result type.


Multi-Language Bindings

Language Mechanism Crate / Package Result contract
Rust Native crate API gf-core ExecutionResult { schema, batches, stats }
Python PyO3 + maturin gf-bindings-py pyarrow.Table or RecordBatchReader
Node napi-rs gf-bindings-node Arrow IPC BuffertableFromIPC(buf)
Swift UniFFI gf-bindings-uniffi + bindings/swift/ Arrow IPC DataGraphForgeResult
Kotlin UniFFI gf-bindings-uniffi + bindings/kotlin/ Arrow IPC ByteArrayGraphForgeResult

The architectural rule: never let a binding become the semantic owner of the language. Bindings translate results; the Rust core owns the semantics.

Swift and Kotlin bindings are generated by UniFFI from a shared UDL interface definition in gf-bindings-uniffi. Both languages receive Arrow IPC bytes and deserialise with their platform Arrow library. See ADR 0002 for the binding strategy rationale.


Migration Strategy

All Rust work targets main. Merge gates required before the v0.5.0 release:

Gate Requirement
Parser parity Existing corpus + syntax goldens pass against RD+Pratt parser
OpenCypher conformance Agreed TCK subset passes
Ontology runtime Load/validate/migrate round-trips pass
Data contract Arrow/Parquet/IPC round-trips pass
Provider baseline Parquet provider passes core semantics
Binding baseline Python, Node, Swift, and Kotlin can execute and consume Arrow results
Observability explain, query IDs, provenance IDs, structured errors available

References