Skip to content

ADR 0005: Graph-Native Adjacency Index

Date: 2026-06-07 Status: Accepted Milestone: M15: Adjacency Index Baseline for v0.5.0


Context

GraphForge is a Knowledge Analysis Workbench. Its core workloads — neighborhood discovery, variable-length traversal, reachability, path finding, and graph algorithms — are fundamentally adjacency-oriented. The v0.5 storage architecture (refactor-v0.5.md, storage.md) makes analytical scans efficient via typed edge tables, but graph traversal still resolves adjacency through relational execution patterns.

Two independent pieces of code already compute adjacency, ephemerally, and they do so twice:

  1. VarLenExpand (shipped, M14 #580). build_adjacency() in crates/gf-exec/src/lib.rs rebuilds a HashMap<u64, Vec<(edge_id, neighbor_id)>> from a full typed-edge scan on every query. The BFS (bfs_emit) then walks that map. Cost is paid per query and scales with total edges, not with the neighborhood actually visited.

  2. Analyst verbs (planned, M18 #610). export_adjacency(label, via, directed, weight) is specified to read the same typed edge tables into an AdjacencyGraph for the AlgorithmBackend trait. All five verbs (rank, cluster, paths, analyze, similar) depend on it.

Left alone, v0.5 would ship two divergent adjacency readers — one in the Cypher traversal path, one in the analyst-verb path — that must later be reconciled. The opportunity is to unify them behind a single derived, rebuildable index before #610 is implemented.

The key question is not "can GraphForge ship without an adjacency index?" (it can). It is whether introducing one now reduces net complexity. Because the concept already exists in two places, the answer is yes: this decision consolidates scope rather than adding it.


Decision

Introduce a graph-native adjacency index as a derived execution artifact, and route both the Cypher traversal path and the analyst-verb path through a single AdjacencyProvider abstraction.

Canonical vs derived

  • Parquet topology remains the sole source of truth. A graph is fully valid and queryable with no adjacency artifacts present.
  • The adjacency index is derived, rebuildable, versioned, and optional. It is an execution accelerator, never a correctness dependency.
  • A stale or missing index MUST never produce incorrect results — only slower ones. The provider detects staleness (topology generation mismatch) and falls back to scan-and-build.

Surrogate-only

Adjacency structures operate exclusively on node_id / edge_id / src_id / dst_id (UInt64 surrogates). UUIDs are public identity and are resolved only at the API boundary, as they already are throughout execution. This matches the existing build_adjacency contract.

One provider, no duplication

Both consumers obtain adjacency through one AdjacencyProvider:

  • VarLenExpand (#580) stops calling its private build_adjacency() and asks the provider.
  • export_adjacency (#610) becomes a thin adapter from the provider to AdjacencyGraph, not a second Parquet reader.

After this decision there is exactly one adjacency implementation in the codebase.

No Graph IR change

This is deliberately a storage + execution concern, invisible to the semver-versioned Graph IR (M12, shipped). No new GraphOp is introduced. Variable-length traversal continues to be encoded on the existing Expand { …, min_hops, max_hops } operator. Whether a traversal uses the adjacency index or a DataFusion hash join is a lowering/planner choice with identical semantics — selected by a new lowering rule (a follow-on to the closed M13 #576/#577), not by the IR.

Physical accelerator operators for the Cypher path (ShortestPathExec, ReachabilityExec) are explicitly out of scope for v0.5 — they are physical-only and deferred until a Cypher shortest-path surface needs them. The analyst verbs already dispatch to algorithm backends.

Storage location: indexes/adjacency/

The index lives under indexes/, not topology/. Rationale:

  1. topology/ is defined as the canonical, hot-path, identity-and-type-only tier. Derived data there would erode the "topology is the source of truth" invariant.
  2. indexes/ is already the established capability module for derived, rebuildable acceleration structures (Tantivy FTS lives there). Adjacency is the same kind of artifact.
  3. Capability-module semantics fit exactly: absent indexes/adjacency/ = build in-memory on-demand (today's behavior); present = load from disk. Never required.

On-disk layout (full schema in storage.md §Derived Indexes):

indexes/
└── adjacency/
    ├── index_manifest.parquet       # topology_generation, built_at, relation_types[], directions[]
    ├── <REL_TYPE>.out.csr           # CSR: offsets (UInt64) + targets (struct{edge_id, neighbor_id})
    ├── <REL_TYPE>.in.csr            # reverse direction
    └── _all.out.csr                 # union across relation types

CSR (compressed sparse row) is keyed by surrogate node_id. Build = stream typed-edge files once, sort by src_id, write offsets/targets. Staleness = topology_generation counter bumped on every topology write; provider compares and rebuilds (or builds in memory) on mismatch.


Consequences

Positive

  • Net complexity reduction. Two parallel adjacency builders collapse to one before the second (#610) is written.
  • Neighborhood-proportional traversal. Variable-length expansion cost scales with the visited frontier + neighborhood, independent of total edge count — the headline success criterion.
  • No per-query full edge scan. VarLenExpand stops rebuilding adjacency from a full scan on every call.
  • Roadmap acceleration. M18 (Rank and Cluster) is already blocked on adjacency (#610); building the real layer once unblocks five verbs with one foundation.
  • IR stability preserved. No change to the shipped, versioned Graph IR or to closed M13 lowering issues — only additive execution/storage work plus one lowering rule.

Negative / Risks

  • One new capability module + CSR format + manifest + versioning logic to maintain (bounded; mirrors the existing Tantivy precedent under indexes/).
  • A new lowering rule adds planner surface (additive, isolated).
  • Adjacency-backed and join-backed paths must provably agree — requires a differential correctness corpus in CI.

Mitigations

  • Differential correctness corpus (adjacency vs DataFusion join) enforced in CI.
  • Staleness fallback: a stale/missing index falls back to scan-and-build with identical results — the index can only affect speed, never correctness.
  • Incremental rebuild deferred to v0.5.1 with the gap logged; v0.5.0 ships full-rebuild + lazy/explicit build, which is sufficient for the success criteria.
  • Scope discipline: no IR operators; no physical Cypher accelerator operators in v0.5.

Alternatives Considered

Alternative Rejected because
Defer to v0.5.1 M18 would ship a standalone export_adjacency, VarLenExpand keeps its per-query full-scan builder, and v0.5.1 inherits two divergent adjacency paths plus a migration project — strictly more total work than doing it once now.
New IR operators (ShortestPathExec, ReachabilityExec as GraphOps) Leaks physical execution strategy into the stable semantic contract, violating the layering ADR 0002 establishes. Adjacency is a lowering/execution choice.
Store adjacency under topology/adjacency/ Erodes the "topology is identity+type only, and canonical" invariant. indexes/ already owns derived, rebuildable accelerators.
Require the index (make it canonical) Breaks the rebuildable/optional principle and the offline/merge story; a graph must remain valid with no derived artifacts.

References