Skip to content

ADR 0002: Rust Core

Date: 2026-05-25 Status: Accepted Branch: rust-core


Context

GraphForge v0.4.x is a pure-Python implementation. It passes 100% of the openCypher TCK, ships three namespaced API surfaces (db.execute, db.gds, db.search), and uses SQLite for persistence. Those surface boundaries were an artifact of the Python architecture — the v0.5.0 API collapses them into seven analyst-intent verbs (forge.execute, forge.rank, forge.cluster, forge.paths, forge.analyze, forge.similar, forge.find) with a unified Arrow result contract.

The next architectural goal is multi-language support — a first-class Rust crate API, a Python binding that returns Arrow results, a Node binding that returns Arrow IPC, and Swift and Kotlin bindings via UniFFI. Three requirements drive this:

  1. Language-independent results — downstream tools should not need to depend on the Python runtime or a Python-specific object model
  2. Performance — variable-length path expansion, provenance tracking, and large-graph algorithms benefit from native code
  3. Extensibility — the current pure-Python executor is difficult to extend with custom operators, storage providers, and planner rules

A secondary goal is to make the architecture explicit: the compiler pipeline, ontology model, and Graph IR should be first-class Rust artifacts rather than implicit Python conventions.


Decision

Refactor GraphForge to a Rust core with multi-language bindings.

All Rust work stages on the rust-core branch and merges to main only after the merge gates below are met.


Architecture Summary

Parser: Recursive Descent + Pratt expression parser

Chosen over LALRPOP, pest, and nom.

The parser is a hand-written recursive-descent clause parser with a Pratt expression parser subroutine — the architecture used by Neo4j's production Cypher parser, PostgreSQL, SQLite, and virtually every other production SQL/query-language parser. It is conflict-free by construction.

See ADR 0003 for the full rationale: LALRPOP was the original plan, but 50 structural shift/reduce conflicts proved unresolvable within LALRPOP's state-merging model. The hand-written parser resolves this permanently and is directly auditable against the openCypher BNF.

  • LALRPOP was the original approach; rejected because LALRPOP's state-merging produces structural conflicts in the full openCypher grammar that cannot be resolved without compromising correctness — see ADR 0003
  • pest was rejected: PEG-based, different parsing model, cannot reuse the hand-written Tok lexer
  • nom was rejected: parser-combinator library optimised for streaming/binary parsing; not a natural fit for a large declarative grammar

Execution: DataFusion + custom graph operators

Chosen over Polars-as-executor and a fully custom executor.

DataFusion exposes exactly the extension points GraphForge needs:

  • Custom LogicalPlan and ExecutionPlan nodes for graph-native operators
  • Custom TableProvider for Parquet, SQLite, DuckDB
  • Custom optimizer rules for predicate pushdown and label-constraint hoisting
  • A QueryPlanner hook so GraphForge IR is lowered directly rather than parsing SQL

Polars is an excellent dataframe engine but does not expose these extension points. A fully custom executor would require implementing scan interfaces, optimizer rules, batch streams, and partitioning from scratch — all of which DataFusion already provides.

Polars is retained as a storage-layer companion for IO/sinks (Parquet, CSV, JSON/NDJSON, IPC) and as an optional convenience layer in the Python binding. It is not the semantic owner of query execution.

Wire contract: Arrow

Arrow provides:

  • A stable, language-independent columnar memory format
  • Zero-copy in-process exchange via the C Data Interface
  • Python exchange via the PyCapsule Interface (no hard PyArrow dependency)
  • Node consumption via Arrow IPC and tableFromIPC in Apache Arrow JS

Arrow schema metadata carries GraphForge-specific annotations (ir_version, ontology_version, query_id, provenance_policy) that survive IPC and Parquet round-trips.

The AST is not the cross-language contract. The Graph IR (semver-versioned) and the Arrow result schema are the stable boundaries.

Storage: Parquet only (initial scope)

Parquet is the sole storage provider for the Rust core. Parquet file metadata carries GraphForge provenance annotations. The StorageProvider trait is designed for extension, but no additional backends (SQLite, DuckDB) are in scope for the initial implementation.

User-Facing API

v0.5.0 replaces the three namespaced surfaces (db.gds, db.search, db.execute) with seven analyst-intent verbs on a single forge instance:

Method What it does Returns
forge.execute(cypher) Run an openCypher query Arrow Table
forge.rank(label, by=...) Score every node (centrality, structural, link prediction) Arrow Table + score
forge.cluster(label, by=...) Assign community membership Arrow Table + community_id
forge.paths(source, target, by=...) Find paths, compute flow, or traverse Arrow Table + cost, path
forge.analyze(label, by=...) Spanning trees, DAG analysis, coloring, matching, embeddings Arrow Table (varies)
forge.similar(label, by=...) Pairwise node similarity Arrow Table + similarity
forge.find(query, ...) Text/vector/hybrid search with lazy indexing Arrow Table + score + matched_on

rank(), cluster(), and paths() accept write_property for opt-in graph mutation. find() indexes lazily on first call; forge.index() is available for explicit control.

Full algorithm catalog: docs/architecture/algorithms.md

Bindings

Language Mechanism Result type
Python PyO3 + maturin pyarrow.Table or RecordBatchReader
Node napi-rs Arrow IPC BuffertableFromIPC(buf)
Swift UniFFI Arrow IPC DataGraphForgeResult
Kotlin UniFFI Arrow IPC ByteArrayGraphForgeResult

Ontology: runtime-loadable

The ontology is a runtime-loadable metadata model, not baked into Rust types at compile time. Authoring format is YAML or JSON (Serde); execution format is Arrow tables compiled at load time; persistence format is Parquet.


Migration Strategy

Fork-first

All Rust work is done on the rust-core branch. The Python 0.4.x codebase on main is the reference implementation and remains the production release until the merge gates below are met.

Merge gates

Gate Requirement
Parser parity Existing corpus + syntax goldens pass against recursive-descent + Pratt parser
openCypher conformance Agreed TCK subset passes end-to-end
Ontology runtime Load/validate/migrate round-trips pass
Data contract Arrow/Parquet/IPC round-trips pass
Provider baseline Parquet provider passes core semantics
Binding baseline Python, Node, Swift, and Kotlin can execute queries and consume Arrow results
Observability explain, query IDs, provenance IDs, structured errors available

Spike sequence (uncertainty reduction)

  1. Parser spike — implement a thin openCypher subset with the recursive-descent + Pratt parser
  2. Ontology spike — load YAML/JSON → Arrow tables → run validator
  3. Execution spike — lower one MATCH query to DataFusion → return Arrow batches
  4. Binding spike — deliver that Arrow payload to Python and Node
  5. SQLite spike — confirm WAL + BEGIN IMMEDIATE + partial indexes + generated columns

Consequences

Positive

  • First-class Rust crate API, Python, Node, Swift, and Kotlin bindings sharing a single semantic core
  • Arrow as a stable, language-independent result contract eliminates per-language serialization
  • DataFusion's optimizer and execution framework accelerate complex query handling without building a full executor from scratch
  • Pluggable storage providers enable Parquet, SQLite, and DuckDB without changing query semantics
  • The ontology as a runtime-loadable model enables schema evolution without recompilation

Negative / Risks

  • The Rust refactor is a significant engineering investment; the rust-core branch will diverge from main for an extended period
  • The parity corpus and differential testing infrastructure need to be built alongside the new parser
  • DataFusion's extension API surface is stable but evolving; custom node implementations will need to track upstream changes
  • The merge gates are strict by design; shipping pressure should not lower them

Mitigations

  • Strict fork-first strategy with defined merge gates avoids "perpetual refactor" drift
  • Spike tasks are sequenced to collapse uncertainty fastest before committing to the full implementation
  • The Python 0.4.x release is not blocked; it continues to ship from main during the refactor

Alternatives Considered

Alternative Rejected because
Polars as primary executor Does not expose compiler extension points (custom plan nodes, optimizer rules, table providers) needed for graph-native semantics
Fully custom Rust executor Front-loads the highest-risk work (scan interfaces, optimizer, batch streams) that DataFusion already provides
Stay Python Cannot deliver first-class Rust or Node bindings; native path expansion and provenance tracking remain slow
DuckDB as semantic owner DuckDB is an excellent accelerator but its extension model is not designed for owning a graph IR and custom operator semantics

References