GraphForge Architecture Overview¶
Status: Rust Core — active development on main
Last Updated: 2026-06-07
Implementation status legend (used across architecture docs): Shipped = implemented and tested on
main; Partially built = some paths real, others stubbed; Designed = specified, not yet implemented. The graph layer's Cypher read/write pipeline and traversal are Shipped; the workbench verbs and the knowledge layer (provenance, confidence, epistemic model) are Designed and owned by dedicated v0.5.0 milestones.
Executive Summary¶
GraphForge is a Knowledge Analysis Workbench — not a graph database or a graph analytics engine. It optimizes for analyst workflows that begin with uncertainty, discover structure over time, and progressively formalize that structure into ontology, workflows, and repeatable analysis.
The project (not the graph) is the primary unit of work:
Project = Knowledge Graph + Documents + Provenance + Embeddings + Workflows + Artifacts + Sync State
The internal implementation is a Rust core with multi-language bindings. The v0.5.0 release introduces a unified API and a compiler pipeline (DataFusion-backed execution, Arrow as the stable in-memory and FFI contract).
The Python 0.4.x codebase is the reference implementation. All Rust work targets main.
Architecture Principles¶
- Arrow is the wire contract — results cross language boundaries as Arrow RecordBatch streams; no GraphForge-specific buffer protocol
- GraphForge owns the semantics — the Cypher compiler, ontology, and Graph IR live in GraphForge-owned Rust crates; no storage provider or binding becomes the semantic owner
- DataFusion is the execution backbone — GraphForge extends DataFusion with custom graph operators rather than writing a full executor from scratch
- Unified result contract — all methods return Arrow Tables; no surface returns a bespoke result type
- Correctness over performance — strict openCypher TCK compliance remains the primary constraint
- Ontology is progressive, not required — GraphForge supports three modes:
exploratory(no ontology required, all labels accepted),advisory(ontology present, violations are warnings), andstrict(ontology enforced, violations are errors). Exploratory analysis is a first-class workflow. See ADR 0004. - Three layers, clean boundaries — graph concerns, knowledge concerns, and workbench concerns are separated; the graph layer stays graph-native and never absorbs the others. See the next section and ADR 0006.
- Preserve the evolution of understanding — GraphForge records not just the current state of knowledge but how it evolved: competing hypotheses, superseded conclusions, evidence, and reasoning are preserved, never destroyed. See ADR 0007.
Layered Architecture¶
GraphForge is a knowledge analysis workbench, not just a graph engine. Its architecture separates three layers with strict boundaries (ADR 0006). Lower layers never depend on higher ones, and — critically — the graph layer never absorbs knowledge or workbench concerns.
┌───────────────────────────────────────────────────────────────────────┐
│ WORKBENCH LAYER │
│ forge.rank / cluster / paths / analyze / similar / find · search · │
│ workflows / recipes · exploration · project portability │
│ — consumes the layers below; holds NO graph-semantic state │
├───────────────────────────────────────────────────────────────────────┤
│ KNOWLEDGE LAYER │
│ provenance · confidence · evidence · ontology-inference lineage · │
│ epistemic assertions + status + supersession + valid-time (ADR 0007) │
│ — attaches to graph objects BY UUID REFERENCE ONLY │
├───────────────────────────────────────────────────────────────────────┤
│ GRAPH LAYER │
│ nodes · edges · properties · traversal · pattern matching · │
│ graph algorithms · adjacency index (ADR 0005) │
│ — graph-native; surrogate-keyed execution; UUID identity │
│ — stores NO knowledge or workbench semantics │
└───────────────────────────────────────────────────────────────────────┘
| Layer | Owns | Where it lives |
|---|---|---|
| Graph | Nodes, edges, properties, traversal, pattern matching, graph algorithms, adjacency | gf-cypher, gf-ir, gf-rel, gf-exec, gf-storage (topology/, properties/, indexes/adjacency/) |
| Knowledge | Provenance, confidence, evidence, epistemic assertions/status/supersession/valid-time | gf-provenance + knowledge module; provenance/, knowledge/ |
| Workbench | Analyst verbs, hybrid search, workflows, exploration, project envelope | gf-api, bindings, search modules |
Boundary rule: knowledge attaches to the graph by UUID reference, never by embedding columns on graph tables. Cypher/traversal/algorithms read only the graph layer, so the presence or absence of knowledge data never changes a graph-native query result (a tested invariant). This keeps the traversal hot path lean and preserves the lightweight-embedded model.
The project (not the graph) is the unit of work, and the layers map onto the project envelope:
Project = Graph (topology + properties) ← graph layer
+ Knowledge (provenance + confidence + evidence + epistemic assertions) ← knowledge layer
+ Workbench assets (documents + embeddings + indexes + workflows + artifacts) ← workbench layer
+ Sync State
High-Level Architecture¶
┌──────────────────────────────────────────────────────────────────────────────────┐
│ GraphForge API │
│ Python: PyO3/maturin · Node: napi-rs · Swift/Kotlin: UniFFI │
│ │
│ forge.execute(…) forge.rank(…) forge.cluster(…) forge.paths(…) │
│ forge.analyze(…) forge.similar(…) forge.find(…) │
└──────────────────────────────────────────────────────────────────────────────────┘
│ │ │ │
▼ └──────────────────┴──────────────────┘
┌──────────────────┐ │
│ Cypher Path │ ▼
│ │ ┌──────────────────────────────────┐
│ RD+Pratt parser │ │ Analyst Verbs │
│ ↓ │ │ rank / cluster / paths / │
│ Binder + │ │ analyze / similar / find │
│ ontology │ │ — bypass parser/planner │
│ ↓ │ │ Export adjacency or index │
│ Graph IR │ │ Dispatch algorithm or search │
│ ↓ │ │ Produce scored Arrow batches │
│ DataFusion │◄─────────────────────────────────┘
│ ↓ │ (all paths converge to DataFusion)
│ Arrow batches │
└──────────────────┘
│
▼
┌──────────────────────────────────────┐
│ Storage Provider │
│ Parquet │
└──────────────────────────────────────┘
Rust Workspace Layout¶
crates/
gf-core/ # public Rust engine facade
gf-ast/ # AST + spans + syntax diagnostics
gf-cypher/ # hand-written lexer + recursive-descent/Pratt parser
gf-ontology/ # runtime ontology model, validation, migration
gf-ir/ # graph IR + serde DTOs
gf-rel/ # graph IR → relational lowering
gf-plan/ # DataFusion integration, optimizer rules, custom nodes
gf-exec/ # execution session, algorithms, search, result streaming
gf-storage/ # StorageProvider trait + Parquet provider
gf-io/ # CSV/JSON/Parquet/IPC sinks and loaders
gf-provenance/ # lineage and confidence models
gf-bindings-py/ # PyO3 + maturin Python binding
gf-bindings-node/ # napi-rs Node binding
gf-bindings-uniffi/ # UniFFI shared binding (Swift + Kotlin)
gf-cli/ # command-line interface
bindings/
swift/ # Swift Package Manager package + XCFramework
kotlin/ # Gradle/Kotlin package + JAR
Feature flags: default = ["datafusion", "parquet"]. Optional: polars, python, node, swift, kotlin.
Three Internal Representations¶
The Cypher compiler maintains three distinct representations — they are not interchangeable:
| Representation | Purpose | Stability |
|---|---|---|
| AST | Syntax-faithful, span-rich, close to Cypher source text | Internal only — no API guarantee |
| Graph IR | Semantic and graph-native; the stable plan contract | Semver-versioned |
| DataFusion logical plan | Relational/physical execution | DataFusion's own contract |
The AST is not the cross-language compatibility surface. The stable boundary is the Graph IR envelope and the Arrow result contract.
See AST & Planning for the full compiler pipeline.
Arrow as the Data Contract¶
All execution results cross language boundaries as Arrow RecordBatch streams. Arrow provides:
- A stable, language-independent columnar memory format
- Zero-copy in-process exchange via the C Data Interface
- Python interchange via the PyCapsule Interface (no hard PyArrow dependency required)
- Node consumption via Arrow IPC and
tableFromIPCin Apache Arrow JS - Swift and Kotlin consumption via Arrow IPC (
Vec<u8>/ByteArrayover UniFFI)
Arrow schema metadata carries GraphForge-specific annotations:
graphforge.ir_version = "1.0.0"
graphforge.ontology_version = "core-2026.05"
graphforge.result_kind = "node_table"
graphforge.confidence_policy = "conservative_minmax"
graphforge.query_id = "01J..."
These annotations survive IPC serialization and Parquet round-trips, which is why Arrow is the correct contract rather than a Polars or Python-specific result type.
Multi-Language Bindings¶
| Language | Mechanism | Crate / Package | Result contract |
|---|---|---|---|
| Rust | Native crate API | gf-core |
ExecutionResult { schema, batches, stats } |
| Python | PyO3 + maturin | gf-bindings-py |
pyarrow.Table or RecordBatchReader |
| Node | napi-rs | gf-bindings-node |
Arrow IPC Buffer → tableFromIPC(buf) |
| Swift | UniFFI | gf-bindings-uniffi + bindings/swift/ |
Arrow IPC Data → GraphForgeResult |
| Kotlin | UniFFI | gf-bindings-uniffi + bindings/kotlin/ |
Arrow IPC ByteArray → GraphForgeResult |
The architectural rule: never let a binding become the semantic owner of the language. Bindings translate results; the Rust core owns the semantics.
Swift and Kotlin bindings are generated by UniFFI from a shared UDL interface definition in
gf-bindings-uniffi. Both languages receive Arrow IPC bytes and deserialise with their
platform Arrow library. See ADR 0002 for the binding
strategy rationale.
Migration Strategy¶
All Rust work targets main. Merge gates required before the v0.5.0 release:
| Gate | Requirement |
|---|---|
| Parser parity | Existing corpus + syntax goldens pass against RD+Pratt parser |
| OpenCypher conformance | Agreed TCK subset passes |
| Ontology runtime | Load/validate/migrate round-trips pass |
| Data contract | Arrow/Parquet/IPC round-trips pass |
| Provider baseline | Parquet provider passes core semantics |
| Binding baseline | Python, Node, Swift, and Kotlin can execute and consume Arrow results |
| Observability | explain, query IDs, provenance IDs, structured errors available |
References¶
- AST & Planning — recursive-descent/Pratt parser, three-tier IR, compiler pipeline
- Algorithm Verbs — full algorithm catalog across rank/cluster/paths/analyze/similar
- Execution Model — DataFusion integration, custom graph operators, Arrow result streams
- Storage — StorageProvider trait, Parquet provider
- ADR 0002: Rust Core — Decision record for the Rust refactor
- ADR 0003: LR(1) Grammar — Parser algorithm decision
- ADR 0004: Progressive Ontology — exploration-first ontology modes
- ADR 0005: Adjacency Index — graph-layer derived traversal accelerator
- ADR 0006: Layered Architecture — graph / knowledge / workbench boundaries
- ADR 0007: Epistemic Model — preserving the evolution of understanding
- Roadmap — Milestones and timeline