ADR 0002: Rust Core¶

Date: 2026-05-25 Status: Accepted Branch: rust-core

Context¶

GraphForge v0.4.x is a pure-Python implementation. It passes 100% of the openCypher TCK, ships three namespaced API surfaces (db.execute, db.gds, db.search), and uses SQLite for persistence. Those surface boundaries were an artifact of the Python architecture — the v0.5.0 API collapses them into seven analyst-intent verbs (forge.execute, forge.rank, forge.cluster, forge.paths, forge.analyze, forge.similar, forge.find) with a unified Arrow result contract.

The next architectural goal is multi-language support — a first-class Rust crate API, a Python binding that returns Arrow results, a Node binding that returns Arrow IPC, and Swift and Kotlin bindings via UniFFI. Three requirements drive this:

Language-independent results — downstream tools should not need to depend on the Python runtime or a Python-specific object model
Performance — variable-length path expansion, provenance tracking, and large-graph algorithms benefit from native code
Extensibility — the current pure-Python executor is difficult to extend with custom operators, storage providers, and planner rules

A secondary goal is to make the architecture explicit: the compiler pipeline, ontology model, and Graph IR should be first-class Rust artifacts rather than implicit Python conventions.

Decision¶

Refactor GraphForge to a Rust core with multi-language bindings.

All Rust work stages on the rust-core branch and merges to main only after the merge gates below are met.

Architecture Summary¶

Parser: Recursive Descent + Pratt expression parser¶

Chosen over LALRPOP, pest, and nom.

The parser is a hand-written recursive-descent clause parser with a Pratt expression parser subroutine — the architecture used by Neo4j's production Cypher parser, PostgreSQL, SQLite, and virtually every other production SQL/query-language parser. It is conflict-free by construction.

See ADR 0003 for the full rationale: LALRPOP was the original plan, but 50 structural shift/reduce conflicts proved unresolvable within LALRPOP's state-merging model. The hand-written parser resolves this permanently and is directly auditable against the openCypher BNF.

LALRPOP was the original approach; rejected because LALRPOP's state-merging produces structural conflicts in the full openCypher grammar that cannot be resolved without compromising correctness — see ADR 0003
pest was rejected: PEG-based, different parsing model, cannot reuse the hand-written Tok lexer
nom was rejected: parser-combinator library optimised for streaming/binary parsing; not a natural fit for a large declarative grammar

Execution: DataFusion + custom graph operators¶

Chosen over Polars-as-executor and a fully custom executor.

DataFusion exposes exactly the extension points GraphForge needs:

Custom LogicalPlan and ExecutionPlan nodes for graph-native operators
Custom TableProvider for Parquet, SQLite, DuckDB
Custom optimizer rules for predicate pushdown and label-constraint hoisting
A QueryPlanner hook so GraphForge IR is lowered directly rather than parsing SQL

Polars is an excellent dataframe engine but does not expose these extension points. A fully custom executor would require implementing scan interfaces, optimizer rules, batch streams, and partitioning from scratch — all of which DataFusion already provides.

Polars is retained as a storage-layer companion for IO/sinks (Parquet, CSV, JSON/NDJSON, IPC) and as an optional convenience layer in the Python binding. It is not the semantic owner of query execution.

Wire contract: Arrow¶

Arrow provides:

A stable, language-independent columnar memory format
Zero-copy in-process exchange via the C Data Interface
Python exchange via the PyCapsule Interface (no hard PyArrow dependency)
Node consumption via Arrow IPC and tableFromIPC in Apache Arrow JS

Arrow schema metadata carries GraphForge-specific annotations (ir_version, ontology_version, query_id, provenance_policy) that survive IPC and Parquet round-trips.

The AST is not the cross-language contract. The Graph IR (semver-versioned) and the Arrow result schema are the stable boundaries.

Storage: Parquet only (initial scope)¶

Parquet is the sole storage provider for the Rust core. Parquet file metadata carries GraphForge provenance annotations. The StorageProvider trait is designed for extension, but no additional backends (SQLite, DuckDB) are in scope for the initial implementation.

User-Facing API¶

v0.5.0 replaces the three namespaced surfaces (db.gds, db.search, db.execute) with seven analyst-intent verbs on a single forge instance:

Method	What it does	Returns
`forge.execute(cypher)`	Run an openCypher query	Arrow Table
`forge.rank(label, by=...)`	Score every node (centrality, structural, link prediction)	Arrow Table + `score`
`forge.cluster(label, by=...)`	Assign community membership	Arrow Table + `community_id`
`forge.paths(source, target, by=...)`	Find paths, compute flow, or traverse	Arrow Table + `cost`, `path`
`forge.analyze(label, by=...)`	Spanning trees, DAG analysis, coloring, matching, embeddings	Arrow Table (varies)
`forge.similar(label, by=...)`	Pairwise node similarity	Arrow Table + `similarity`
`forge.find(query, ...)`	Text/vector/hybrid search with lazy indexing	Arrow Table + `score` + `matched_on`

rank(), cluster(), and paths() accept write_property for opt-in graph mutation. find() indexes lazily on first call; forge.index() is available for explicit control.

Full algorithm catalog: docs/architecture/algorithms.md

Bindings¶

Language	Mechanism	Result type
Python	PyO3 + maturin	`pyarrow.Table` or `RecordBatchReader`
Node	napi-rs	Arrow IPC `Buffer` → `tableFromIPC(buf)`
Swift	UniFFI	Arrow IPC `Data` → `GraphForgeResult`
Kotlin	UniFFI	Arrow IPC `ByteArray` → `GraphForgeResult`

Ontology: runtime-loadable¶

The ontology is a runtime-loadable metadata model, not baked into Rust types at compile time. Authoring format is YAML or JSON (Serde); execution format is Arrow tables compiled at load time; persistence format is Parquet.

Migration Strategy¶

Fork-first¶

All Rust work is done on the rust-core branch. The Python 0.4.x codebase on main is the reference implementation and remains the production release until the merge gates below are met.

Merge gates¶

Gate	Requirement
Parser parity	Existing corpus + syntax goldens pass against recursive-descent + Pratt parser
openCypher conformance	Agreed TCK subset passes end-to-end
Ontology runtime	Load/validate/migrate round-trips pass
Data contract	Arrow/Parquet/IPC round-trips pass
Provider baseline	Parquet provider passes core semantics
Binding baseline	Python, Node, Swift, and Kotlin can execute queries and consume Arrow results
Observability	`explain`, query IDs, provenance IDs, structured errors available

Spike sequence (uncertainty reduction)¶

Parser spike — implement a thin openCypher subset with the recursive-descent + Pratt parser
Ontology spike — load YAML/JSON → Arrow tables → run validator
Execution spike — lower one MATCH query to DataFusion → return Arrow batches
Binding spike — deliver that Arrow payload to Python and Node
SQLite spike — confirm WAL + BEGIN IMMEDIATE + partial indexes + generated columns

Consequences¶

Positive¶

First-class Rust crate API, Python, Node, Swift, and Kotlin bindings sharing a single semantic core
Arrow as a stable, language-independent result contract eliminates per-language serialization
DataFusion's optimizer and execution framework accelerate complex query handling without building a full executor from scratch
Pluggable storage providers enable Parquet, SQLite, and DuckDB without changing query semantics
The ontology as a runtime-loadable model enables schema evolution without recompilation

Negative / Risks¶

The Rust refactor is a significant engineering investment; the rust-core branch will diverge from main for an extended period
The parity corpus and differential testing infrastructure need to be built alongside the new parser
DataFusion's extension API surface is stable but evolving; custom node implementations will need to track upstream changes
The merge gates are strict by design; shipping pressure should not lower them

Mitigations¶

Strict fork-first strategy with defined merge gates avoids "perpetual refactor" drift
Spike tasks are sequenced to collapse uncertainty fastest before committing to the full implementation
The Python 0.4.x release is not blocked; it continues to ship from main during the refactor

Alternatives Considered¶

Alternative	Rejected because
Polars as primary executor	Does not expose compiler extension points (custom plan nodes, optimizer rules, table providers) needed for graph-native semantics
Fully custom Rust executor	Front-loads the highest-risk work (scan interfaces, optimizer, batch streams) that DataFusion already provides
Stay Python	Cannot deliver first-class Rust or Node bindings; native path expansion and provenance tracking remain slow
DuckDB as semantic owner	DuckDB is an excellent accelerator but its extension model is not designed for owning a graph IR and custom operator semantics

References¶

Architecture Overview
AST & Planning
Execution Model
Storage Architecture
ADR 0006: Layered Architecture — the graph/knowledge/workbench layer model and crate-boundary rule that organises the crates listed above