GraphForge v0.5 Architecture Reference¶

Status: Architecture Baseline — decision record
Last Updated: 2026-05-31
Milestone: v0.5.0: Architecture Baseline

This document is the authoritative reference for architectural decisions made before and during the v0.5 refactor. Every subsequent milestone should treat this document as the primary constraint document. When an implementation decision conflicts with something here, update this document first, then update the implementation.

Product positioning: GraphForge is a Knowledge Analysis Workbench — not a graph database and not merely a graph analytics engine. The graph is one asset inside a larger analytical project. The workbench is designed for analysts who build knowledge from uncertain, incomplete, or heterogeneous data over time.

1. Architecture Diagram¶

┌─────────────────────────────────────────────────────────────────────┐
│                          User Surfaces                               │
│   Python (PyO3)  ·  Node (napi-rs)  ·  Swift/Kotlin (UniFFI)       │
│   forge.execute()  forge.rank()  forge.cluster()  forge.find()      │
└──────────────────────────┬──────────────────────────────────────────┘
                           │
         ┌─────────────────┴──────────────────────────────┐
         │  Cypher / GQL Path                              │  Analyst Verbs Path
         ▼                                                 │
   ┌─────────────┐                                         │  rank / cluster / paths /
   │   Parser    │  (gf-cypher)                            │  analyze / similar / find
   │ RD + Pratt  │                                         │  — bypass parser/binder
   └──────┬──────┘                                         │  — export adjacency or index
          ▼                                                 │  — dispatch algorithm backend
   ┌─────────────┐                                         │  — return scored Arrow batches
   │     AST     │  (gf-ast)                               │
   │  spans +    │                                         │
   │  syntax     │                                         │
   └──────┬──────┘                                         │
          ▼                                                 │
   ┌─────────────┐                                         │
   │   Binder    │  (gf-ir)                                │
   │  ontology   │  Resolves names → TypeId/PropId         │
   │  resolution │  Variable scope validation              │
   └──────┬──────┘                                         │
          ▼                                                 │
   ┌─────────────┐                                         │
   │  Graph IR   │  (gf-ir)  ← stable semantic contract   │
   │  GraphPlan  │  DataFusion-independent                 │
   │  + ExprArena│                                         │
   └──────┬──────┘                                         │
          ▼                                                 │
   ┌─────────────┐                                         │
   │  Relational │  (gf-rel)                               │
   │  Lowering   │  GraphOp → DataFusion LogicalPlan        │
   └──────┬──────┘                                         │
          ▼                                                 │
   ┌──────────────────────────────────────────────────────┐│
   │              DataFusion Execution Engine              ││
   │   LogicalPlan → Optimizer → PhysicalPlan → Execute   ││
   │   Custom operators: VarLenExpand, OptionalMatch,      ││
   │   PathUnique, ProvenanceSemijoin, OntologyInfer,      ││
   │   GraphMerge, TypedEdgeScan (NEW)                     ││
   └──────────────────────────────────────────────────────┘│
          │◄────────────────────────────────────────────────┘
          ▼
   ┌─────────────┐
   │Arrow Batches│  UUID columns in every output
   └──────┬──────┘
          ▼
   ┌──────────────────────────────────────────────────────┐
   │                  Storage Provider                     │
   │  topology/  ·  properties/  ·  documents/            │
   │  embeddings/  ·  provenance/  ·  indexes/             │
   └──────────────────────────────────────────────────────┘

1.5. Layering — Graph / Knowledge / Workbench¶

GraphForge is organised into three layers with strict boundaries (ADR 0006):

Graph layer — nodes, edges, properties, traversal, pattern matching, graph algorithms, the adjacency index. Lives in topology/ + properties/ + indexes/adjacency/. Graph-native and surrogate-keyed; stores no knowledge or workbench semantics.
Knowledge layer — provenance, confidence, evidence, ontology-inference lineage, and the epistemic model (assertions, status, supersession, reasoning, bitemporal valid-time; ADR 0007). Lives in provenance/ + knowledge/. Attaches to graph objects by UUID reference only.
Workbench layer — the analyst verbs, hybrid search, workflows, exploration, project envelope. Consumes the layers below and returns Arrow; holds no graph-semantic state.

Boundary rule: knowledge attaches by UUID reference, never by embedding columns on graph tables. Graph-native query results never depend on knowledge-layer data (a tested invariant). This is what keeps the traversal hot path lean and the model lightweight-embedded. The directory structure below maps directly onto these layers.

2. Project-Centric Directory Structure¶

GraphForge is a project-centric knowledge analysis workbench. The graph is one asset inside a larger analytical project — not the whole project.

Capability-Based Structure¶

Folders are capability modules. An absent folder means that capability is not yet enabled. Projects start simple and grow capabilities over time.

Template	topology	properties	documents	provenance	embeddings	indexes	workflows	artifacts	sync
Minimal	yes	yes	—	—	—	—	—	—	—
Exploratory investigation	yes	yes	yes	yes	—	—	—	—	—
Full workbench	yes	yes	yes	yes	yes	yes	yes	yes	yes
Published canonical	yes	yes	—	yes	—	—	—	—	—

Full Directory Tree¶

my-investigation/
├── graphforge.yaml          # project manifest (version, name, ontology path, ontology_mode, ir_version)
├── ontology.yaml            # domain ontology (optional — see Section 3.5)
│
├── topology/                # hot path — identity + type only, no properties
│   ├── nodes.parquet        # node_uuid(BINARY16), node_id(UInt64), type_id(UInt32), created_at, updated_at
│   └── edges/               # typed edge tables — one file per relation type
│       ├── WORKS_AT.parquet    # edge_uuid, src_uuid, dst_uuid, edge_id(UInt64), src_id(UInt64), dst_id(UInt64), created_at, confidence, provenance_uuid
│       ├── OWNS.parquet
│       ├── MENTIONS.parquet
│       └── LOCATED_IN.parquet
│
├── properties/              # warm path — columnar property storage
│   ├── Person.parquet       # node_uuid, name, email, age, ...
│   ├── Organization.parquet # node_uuid, org_name, country, ...
│   └── Document.parquet     # doc_uuid, title, date, content_hash, ...
│
├── documents/               # raw source documents
│   ├── {doc_uuid}.pdf
│   └── {doc_uuid}.json
│
├── embeddings/              # vector stores, one subdirectory per model
│   ├── text-embedding-3/
│   │   └── {embedding_uuid}.parquet  # entity_uuid, vector FIXED_SIZE_LIST(Float32, N)
│   └── custom-model/
│
├── provenance/              # provenance events + lineage graph
│   ├── events.parquet       # provenance_uuid, kind, source_ref, analyst_id, created_at
│   └── lineage.parquet      # parent_provenance_uuid, child_provenance_uuid
│
├── indexes/                 # full-text search indexes (Tantivy)
│   └── Person/
│       └── tantivy/
│
├── workflows/               # saved analysis pipelines (serialised GraphPlan JSON)
│   └── {workflow_uuid}.json
│
├── artifacts/               # outputs — ranked lists, clusters, exported reports
│   ├── {artifact_uuid}_rank.parquet
│   └── {artifact_uuid}_cluster.parquet
│
└── sync/                    # changeset-based collaboration (post-v0.5)
    ├── local/               # outbound changeset queue
    ├── inbound/             # received changesets
    ├── checkpoints/         # last-synced state per remote
    └── merge_history/       # reconciliation log

Also present in exploratory mode:

├── topology/
│   └── edges/
│       └── _exploratory.parquet  # catch-all for unknown relation types
└── topology/
    └── runtime_catalog.parquet   # RuntimeCatalog: observed labels/types/properties

Key Design Decisions¶

Typed edge tables (topology/edges/TYPENAME.parquet) — each relation type has its own Parquet file. DataFusion can scan WORKS_AT.parquet directly without filtering a unified edges table. Critical for 100M+ edge graphs. In exploratory mode, unknown relation types fall back to _exploratory.parquet.
Topology/properties split — topology/ holds only identity + type; properties/ holds columns. Enables column pruning: a graph traversal never reads property data unless the query explicitly projects it.
Typed property files — one Parquet per entity type. Schema evolution is per-type. Adding a column to Person.parquet does not affect Organization.parquet.
Progressive ontology — ontology.yaml is optional. See Section 3.5.
Capability modules — folders are capabilities. Absent folders are not-yet-enabled, not misconfigured. A project can start as topology/ + properties/ and add provenance/, embeddings/, etc. incrementally.

3. Ontology Model¶

The ontology is already implemented in gf-ontology (M10). This section documents how it integrates with the new storage architecture.

The ontology.yaml file in the project root is the schema authority. It drives:

The binder — resolves label strings to TypeId/PropId integers
The relational lowering — tells the planner which typed edge file to scan for a given relation type
The storage provider — maps entity type names to properties/TYPENAME.parquet paths
Validation — load-time validation of the graph against declared constraints

The ontology is not a database schema enforced on write. It is a semantic description that enables optimized planning and validation. Data that violates the ontology can still be stored; the planner and validator will flag violations.

The ontology is optional. See Section 3.5 for progressive ontology modes.

3.5. Progressive Ontology¶

GraphForge supports three ontology modes to accommodate the full analyst spectrum — from exploratory discovery to production-grade validated graphs. See ADR 0004 for the full rationale and design.

Three modes¶

Mode	When	Binder behaviour	Violations
`exploratory`	No ontology.yaml present	Creates RuntimeTypeId for unknowns via RuntimeCatalog	None — all labels accepted
`advisory`	ontology.yaml present, mode unset or explicit	Warns on unknowns; adds to RuntimeCatalog	Warnings only
`strict`	`ontology_mode: strict` in graphforge.yaml	BindError on any unknown	Hard errors

`graphforge.yaml` manifest¶

The project manifest includes ontology_mode and an optional ontology path:

# Exploratory project — no ontology required
project_uuid: "0195f3a2-..."
name: "Investigation Alpha"
version: "1"
ontology_mode: exploratory
ontology: null
ir_version: "0.1.0"
graphforge_version: "0.5.0"
capabilities:          # absent = false; only list enabled capabilities
  topology: true
  properties: true
  documents: true
  provenance: true

# Advisory project — ontology present but non-enforcing
project_uuid: "0195f3a3-..."
name: "Entity Graph v2"
version: "1"
ontology_mode: advisory      # default when ontology.yaml present
ontology: "./ontology.yaml"

# Strict project — fully validated
project_uuid: "0195f3a4-..."
name: "Production Knowledge Base"
version: "1"
ontology_mode: strict        # must be explicitly declared
ontology: "./ontology.yaml"

RuntimeCatalog¶

The RuntimeCatalog is an auto-growing registry present in all modes:

RuntimeCatalog
├── entity_types      label → RuntimeTypeId (UInt32, session-local)
├── relation_types    name  → RuntimeTypeId
├── property_names    name  → RuntimePropId
├── statistics        observation counts, first/last-seen timestamps
└── inference_hints   suggested ontology entries (for export)

Persisted to topology/runtime_catalog.parquet. Can be exported to ontology.yaml via forge.suggest_ontology().

Storage in exploratory mode¶

Typed edge tables require predefined relation type names. In exploratory mode: - Unknown relation types write to topology/edges/_exploratory.parquet (includes rel_type_name: Utf8 column) - Known types (from RuntimeCatalog once promoted) use typed edge tables - A maintenance operation can reorganise rows into typed files after the ontology is formalised

Binder open-world resolution (exploratory/advisory modes)¶

bind(label: &str) → TypeId:
1. Check OntologyHandle (if present): return TypeId if found
2. Check RuntimeCatalog: return existing RuntimeTypeId if previously seen
3. New label: RuntimeCatalog::intern(label) → new RuntimeTypeId
4. Return RuntimeTypeId for use in GraphOp

In strict mode, step 3 is replaced by BindError::UnknownLabel.

IR envelope¶

GraphPlan.ontology_version is optional:

GraphPlan {
    ir_version: IrVersion,
    dialect: String,
    ontology_version: Option<OntologyVersion>,  // None in exploratory mode
    ontology_mode: OntologyMode,                // exploratory | advisory | strict
    ops: Vec<GraphOp>,
    exprs: ExprArena,
}

Migration path¶

exploratory → advisory → strict

Transitions are always the analyst's choice. Moving to a stricter mode never deletes data — it only adds validation.

4. Graph IR Model¶

The Graph IR (gf-ir) is the stable semantic contract between the compiler and the execution engine. It is deliberately DataFusion-independent — the IR can be serialized, stored, inspected, and replayed without DataFusion being present.

Operators¶

Operator	Description
`NodeScan { var, ty }`	Scan all nodes of a type (or all types if `ty` is `None`)
`TypedEdgeScan { var, rel_ty }`	NEW — scan a specific typed edge table directly
`Expand { src, edge, dst, rel_ty, dir, hops }`	Traverse edges from a bound node
`Filter { predicate }`	Apply a boolean expression
`Project { items, distinct }`	Select and alias output columns
`Aggregate { group_by, aggs }`	Grouping aggregations
`Sort { keys }`	ORDER BY
`Limit / Skip`	Pagination
`Optional { child }`	OPTIONAL MATCH
`Union { all, inputs }`	UNION / UNION ALL
`Unwind { list_expr, alias }`	List expansion
`With { items, where_predicate }`	Pipeline stage
`Create / Merge { pattern }`	Write operations (M13)

TypedEdgeScan is the key addition for typed edge table storage — the binder emits this operator when a pattern specifies a relation type, bypassing a full edges scan.

Adjacency note (ADR 0005). The graph-native adjacency index introduces no new IR operator. Variable-length traversal remains encoded on Expand { …, min_hops, max_hops }. Whether a traversal is executed via the adjacency index or a DataFusion hash join is a lowering/planner choice with identical semantics — not part of this stable semantic contract. See Execution Model §Adjacency-Backed Execution.

IR Envelope¶

GraphPlan {
    ir_version: IrVersion,          // semver, current = 0.1.0
    dialect: String,                // "opencypher-9" | "gql-1"
    ontology_version: OntologyVersion,  // checksum of ontology used at bind time
    feature_flags: Vec<String>,
    ops: Vec<GraphOp>,
    exprs: ExprArena,
}

5. Identity Model¶

Decision: UUIDv7 as Canonical Identity¶

Every first-class object in GraphForge has a globally unique immutable identifier.

The chosen format is UUIDv7 (RFC 9562): - Time-ordered: monotonically increasing within a millisecond window - Globally unique without coordination - 128-bit: fits in Arrow FixedSizeBinary(16) - Suitable for Parquet row-group statistics (time-ordering enables range pruning) - Supports offline generation on mobile devices, air-gapped systems, disconnected analysts

Objects Requiring UUID Identity¶

Every object in the following categories receives a UUID at creation time and retains it for its entire lifecycle, regardless of where it moves or how the project is restructured:

Object	UUID column name	Notes
Node (entity)	`node_uuid`	Canonical entity identity
Edge (relationship)	`edge_uuid`	Canonical relationship identity
Document	`doc_uuid`	Source document
Observation	`obs_uuid`	A recorded observation
Assertion	`assertion_uuid`	A derived or inferred claim
Provenance event	`provenance_uuid`	Lineage record
Analyst / User	`analyst_uuid`	Contributor identity
Project	`project_uuid`	Project-level identity
Workflow	`workflow_uuid`	Saved analysis pipeline
Embedding	`embedding_uuid`	Vector artifact
Source reference	`source_uuid`	External system / document source
Ranking output row	`rank_uuid`	Reproducible result reference
Clustering output row	`cluster_uuid`	Reproducible result reference
Generated artifact	`artifact_uuid`	Any analytical output

UUID + Surrogate Key Pattern¶

UUIDs are the canonical stable identity. Integer surrogate keys are a per-session execution optimization.

topology/nodes.parquet columns:
  node_uuid   BINARY(16)   -- UUIDv7, globally unique, immutable
  node_id     UInt64       -- local surrogate, assigned at load time
  type_id     UInt32       -- ontology entity type integer
  created_at  Timestamp
  updated_at  Timestamp

topology/edges/WORKS_AT.parquet columns:
  edge_uuid         BINARY(16)   -- UUIDv7
  src_uuid          BINARY(16)   -- references node_uuid
  dst_uuid          BINARY(16)   -- references node_uuid
  edge_id           UInt64       -- local surrogate
  src_id            UInt64       -- local surrogate (for DataFusion joins)
  dst_id            UInt64       -- local surrogate
  created_at        Timestamp
  confidence        Float64
  provenance_uuid   BINARY(16)

The relational lowering layer maps node_uuid → node_id once at scan time, then all DataFusion joins use integer surrogates. Results are projected back to UUIDs before returning to the user.

Architectural Principle¶

Entity identity is globally unique.
Storage location is temporary.
Ownership is temporary.
Projects are temporary.
Identity is permanent.

6. Provenance Model (Knowledge Layer)¶

Layer + status (ADR 0006). Provenance is a knowledge-layer concern, not a graph-layer one. Graph objects carry a provenance_uuid reference; the events and lineage themselves live in the knowledge layer and are produced/interpreted there — never as graph semantics. Status: the schema below is Designed; provenance events are not yet written (gf-provenance is a stub, provenance_uuid is currently NULL on write). Making this real — writing events, propagating confidence — is owned by the Knowledge Layer Foundation milestone. The epistemic extension (assertions, status, supersession, evidence, reasoning, bitemporal valid-time) builds on this schema and is specified in ADR 0007.

Every fact (node, edge, assertion, ranking output) carries a provenance_uuid that references an entry in provenance/events.parquet.

provenance/events.parquet:
  provenance_uuid   BINARY(16)    -- UUID of this provenance event
  kind              Utf8           -- "ingestion" | "inference" | "assertion" | "merge"
  source_ref        BINARY(16)    -- UUID of source document or system
  analyst_uuid      BINARY(16)    -- UUID of the analyst who created this
  rule_id           Utf8          -- ontology inference rule ID (nullable)
  confidence        Float64       -- confidence score for this event
  query_id          Utf8          -- UUID of the query that produced this fact
  created_at        Timestamp

provenance/lineage.parquet:
  parent_uuid       BINARY(16)    -- provenance event that contributed to this one
  child_uuid        BINARY(16)    -- downstream provenance event
  role              Utf8          -- "input" | "derived" | "merged"
  weight            Float64       -- contribution weight

The conservative_min confidence policy: confidence of a derived fact = min(confidence of inputs).

7. Storage Architecture¶

Three Tiers¶

Tier 1: Topology (hot path — small files, frequently scanned)

The topology layer holds only identity and type. Graph traversal reads only this tier.

topology/nodes.parquet          -- all nodes, UUID + type + timestamps
topology/edges/TYPENAME.parquet -- one file per relation type

Tier 2: Properties (warm path — larger files, column-pruned)

Property access requires a join from topology UUID to property file.

properties/ENTITY_TYPE.parquet  -- one file per entity type, UUID + all properties

Tier 3: Analytical Artifacts (cold path — written once, read rarely)

documents/        raw source files
embeddings/       vector stores
provenance/       lineage records
indexes/          FTS indexes
workflows/        saved pipelines
artifacts/        ranked/clustered outputs

Performance Implications by Scale¶

Scale	Critical optimizations
1M–10M edges	Current design sufficient. Typed edge tables provide modest gains.
10M–100M edges	Typed edge tables become important — avoid scanning all edges to find one type. Topology/properties split reduces traversal memory.
100M–1B edges	Typed edge tables are mandatory. Partition by UUID prefix (first byte) for parallel scans. Surrogate IDs critical (avoid 128-bit joins in hot path). Consider time-range partitioning for temporal queries.
1B+ edges	Out of scope for v0.5. Would require distributed execution (DataFusion Ballista or similar).

8. Revised Milestone Plan¶

The v0.5 milestones were renumbered when the adjacency index (ADR 0005) was adopted as a first-class layer. The Architecture Baseline was also moved to its true chronological slot (M11, between Ontology Runtime and Graph IR). GitHub milestone IDs are immutable; the M##: prefix in each title is the canonical sequence below.

Milestone	State	Notes
M11: Architecture Baseline	closed	Decision-record sprint (this document, UUIDv7, typed edges, topology/properties split, progressive ontology). Formerly numbered M18.
M12: Graph IR	closed	`GraphOp` set incl. `TypedEdgeScan`; `Expand { min_hops, max_hops }` encodes variable-length. No `VarLenExpand` IR operator.
M13: Relational Lowering	closed	`TypedEdgeScan` → typed-edge `TableProvider`; UUID→surrogate mapping; `VarLenExpandNode` logical stub (#577).
M14: Execution Baseline	open	Custom physical nodes incl. VarLenExpand (#580). Open issues are an edge-property-persistence cluster, orthogonal to adjacency.
M15: Adjacency Index Baseline	open	NEW. Consolidates the two ephemeral adjacency builders (VarLenExpand `build_adjacency`, analyst-verb `export_adjacency`) into one derived `AdjacencyProvider` + CSR index under `indexes/adjacency/`. No IR change. Gates M18. See ADR 0005.
M16: Bindings Baseline	open	No adjacency dependency.
M17: Conformance Hardening	open	No adjacency dependency (fuzzing may exercise the adjacency path once it lands).
M18: Rank and Cluster	open	Depends on M15. `export_adjacency` (#610) becomes a provider adapter, not a second reader.
M19: Find	open	No adjacency dependency. Tantivy FTS + vector store, also under `indexes/`.
M20: Swift and Kotlin Bindings	open	v0.5.1. No adjacency dependency.

Why a new milestone rather than folding adjacency into M18¶

Adjacency is already the root dependency of all five M18 verbs (#610) and already exists, ephemerally, inside shipped M14 VarLenExpand (build_adjacency). Building the real layer once, in its own milestone gated before M18, prevents a second throwaway adjacency reader and keeps the IR (M12) and closed lowering issues (M13) untouched. See ADR 0005 for the full tradeoff analysis.

Revised milestone dependency graph¶

M14 Execution Baseline
   ├──> M16 Bindings        (parallel, no adj dep)
   ├──> M17 Conformance     (parallel, no adj dep)
   ├──> M19 Find            (parallel, no adj dep)
   └──> M15 Adjacency Index ──> M18 Rank and Cluster ──> v0.5.0 close-out ──> M20 Swift/Kotlin (v0.5.1)

9. Migration Plan¶

There is no production data to migrate. The gf-storage and gf-exec crates are stubs returning NotImplemented. The gf-ontology implementation (M10) is compatible with the new architecture — the ontology drives the binder regardless of storage layout. The gf-ir primitives (M11 #565) are compatible.

Migration path: 1. Write this document (Architecture Baseline milestone) 2. Update M11 issues #566–#568 to reference typed edge tables and UUID identity 3. Implement M11 as planned 4. When implementing M12/M13 storage, adopt the new layout from the start 5. There is no "migration" from old to new storage — the storage implementation hasn't shipped yet

10. Performance Analysis¶

Current Roadmap (flat node_facts/edge_facts)¶

MATCH (a:Person)-[:WORKS_AT]->(b:Organization)
→ Scan all edges → Filter type = WORKS_AT → Join with nodes

At 100M edges: scanning all edges to find one type wastes I/O. No row-group pruning possible on type column when all types are mixed.

Revised Roadmap (typed edge tables)¶

MATCH (a:Person)-[:WORKS_AT]->(b:Organization)
→ TypedEdgeScan(WORKS_AT) → read WORKS_AT.parquet only → Join with topology

At 100M edges: if 1% of edges are WORKS_AT, we read 1M rows instead of 100M. 100x I/O reduction.

UUID Impact¶

UUID overhead: 16 bytes vs 4 bytes per ID = 4x per ID column. Mitigated by: - Surrogate integers for join columns (only UUID at API surface and for identity) - Arrow dictionary encoding for repeated values - Parquet binary column compression (UUIDs compress well with ZSTD)

Net additional storage: ~15–20% for UUID columns vs integer-only design. Acceptable given the merge/federation benefits.

11. Risks and Tradeoffs¶

Decision	Benefit	Risk	Mitigation
UUIDv7 identity	Merge-safe, offline-safe, globally unique	16 bytes vs 4 bytes; slower integer joins	Surrogate integer IDs for execution joins; UUID only at API surface
Typed edge tables	Direct scans, no type filter, partition stats	More files, schema evolution per-type	Ontology drives file layout; `_exploratory.parquet` catch-all in exploratory mode
Topology/properties split	Column pruning, less traversal memory	Extra join for property access	DataFusion handles this as a hash join; modern CPUs handle it cheaply
Project-centric structure	Multi-asset, future-proof, federation-ready	More complex than single graph file	graphforge.yaml manifest makes structure discoverable
Conservative min confidence	Simple, predictable, conservative	May undervalue high-quality inferences	Pluggable confidence policies; conservative_min is the safe default
Progressive ontology (3 modes)	Exploratory analysts can start immediately; structured projects get strict validation	Binder/storage provider complexity; advisory warning surface needed	Sensible defaults; RuntimeCatalog handles the exploratory→strict upgrade path

12. Collaboration Architecture (post-v0.5)¶

Four concepts support multi-analyst collaboration. They are distinct and should remain so.

Changesets — unit of change¶

A changeset is the atomic unit of knowledge contribution. It carries a UUIDv7, analyst UUID, timestamp, and contains one or more mutations (node/edge additions, property changes, assertions) plus their provenance.

Sync — movement of changesets¶

Sync moves changesets between projects. It is transport-layer: local filesystem copy, HTTP push/pull, or air-gapped USB transfer.

sync/
├── local/          # outbound changeset queue
├── inbound/        # received changesets pending merge
├── checkpoints/    # last-synced state per remote
└── merge_history/  # reconciliation log

Merge — semantic reconciliation¶

Merge reconciles overlapping knowledge. It operates on UUID identity and entity resolution, not file diffs. Provenance is preserved through merge; confidence scoring resolves conflicts. Merge is deterministic given the same input changesets regardless of arrival order.

Federation — trust and policy boundary¶

Federation governs who can sync, what can sync, redaction rules, trust policies, audit requirements, and federated learning policies. Federation is not sync — it is the policy layer above sync.

Readiness assessment (v0.5)¶

Concept	Ready now	Post-v0.5
Changeset	UUIDv7 identity, provenance model	Formal changeset envelope, signing
Sync	Project directory portable (zip and share)	Protocol, remotes, incremental transfer
Merge	UUID union of topology files	Entity resolution, confidence merge, conflict UI
Federation	—	Auth, trust policy, redaction rules, audit

The foundational decisions (UUIDv7 identity, project-centric structure, typed edge tables, provenance model) make all four concepts architecturally possible. No structural changes will be needed — only new crates/services layered on top.

References¶

Architecture Overview
AST & Planning
Storage Architecture
Execution Model
Algorithms
RFC 9562 — UUIDv7 specification
GraphAr project — concepts evaluated but not adopted as dependency