GraphForge v0.5 Architecture Reference¶
Status: Architecture Baseline — decision record
Last Updated: 2026-05-31
Milestone: v0.5.0: Architecture Baseline
This document is the authoritative reference for architectural decisions made before and during the v0.5 refactor. Every subsequent milestone should treat this document as the primary constraint document. When an implementation decision conflicts with something here, update this document first, then update the implementation.
Product positioning: GraphForge is a Knowledge Analysis Workbench — not a graph database and not merely a graph analytics engine. The graph is one asset inside a larger analytical project. The workbench is designed for analysts who build knowledge from uncertain, incomplete, or heterogeneous data over time.
1. Architecture Diagram¶
┌─────────────────────────────────────────────────────────────────────┐
│ User Surfaces │
│ Python (PyO3) · Node (napi-rs) · Swift/Kotlin (UniFFI) │
│ forge.execute() forge.rank() forge.cluster() forge.find() │
└──────────────────────────┬──────────────────────────────────────────┘
│
┌─────────────────┴──────────────────────────────┐
│ Cypher / GQL Path │ Analyst Verbs Path
▼ │
┌─────────────┐ │ rank / cluster / paths /
│ Parser │ (gf-cypher) │ analyze / similar / find
│ RD + Pratt │ │ — bypass parser/binder
└──────┬──────┘ │ — export adjacency or index
▼ │ — dispatch algorithm backend
┌─────────────┐ │ — return scored Arrow batches
│ AST │ (gf-ast) │
│ spans + │ │
│ syntax │ │
└──────┬──────┘ │
▼ │
┌─────────────┐ │
│ Binder │ (gf-ir) │
│ ontology │ Resolves names → TypeId/PropId │
│ resolution │ Variable scope validation │
└──────┬──────┘ │
▼ │
┌─────────────┐ │
│ Graph IR │ (gf-ir) ← stable semantic contract │
│ GraphPlan │ DataFusion-independent │
│ + ExprArena│ │
└──────┬──────┘ │
▼ │
┌─────────────┐ │
│ Relational │ (gf-rel) │
│ Lowering │ GraphOp → DataFusion LogicalPlan │
└──────┬──────┘ │
▼ │
┌──────────────────────────────────────────────────────┐│
│ DataFusion Execution Engine ││
│ LogicalPlan → Optimizer → PhysicalPlan → Execute ││
│ Custom operators: VarLenExpand, OptionalMatch, ││
│ PathUnique, ProvenanceSemijoin, OntologyInfer, ││
│ GraphMerge, TypedEdgeScan (NEW) ││
└──────────────────────────────────────────────────────┘│
│◄────────────────────────────────────────────────┘
▼
┌─────────────┐
│Arrow Batches│ UUID columns in every output
└──────┬──────┘
▼
┌──────────────────────────────────────────────────────┐
│ Storage Provider │
│ topology/ · properties/ · documents/ │
│ embeddings/ · provenance/ · indexes/ │
└──────────────────────────────────────────────────────┘
1.5. Layering — Graph / Knowledge / Workbench¶
GraphForge is organised into three layers with strict boundaries (ADR 0006):
- Graph layer — nodes, edges, properties, traversal, pattern matching, graph algorithms, the
adjacency index. Lives in
topology/+properties/+indexes/adjacency/. Graph-native and surrogate-keyed; stores no knowledge or workbench semantics. - Knowledge layer — provenance, confidence, evidence, ontology-inference lineage, and the
epistemic model (assertions, status, supersession, reasoning, bitemporal valid-time;
ADR 0007). Lives in
provenance/+knowledge/. Attaches to graph objects by UUID reference only. - Workbench layer — the analyst verbs, hybrid search, workflows, exploration, project envelope. Consumes the layers below and returns Arrow; holds no graph-semantic state.
Boundary rule: knowledge attaches by UUID reference, never by embedding columns on graph tables. Graph-native query results never depend on knowledge-layer data (a tested invariant). This is what keeps the traversal hot path lean and the model lightweight-embedded. The directory structure below maps directly onto these layers.
2. Project-Centric Directory Structure¶
GraphForge is a project-centric knowledge analysis workbench. The graph is one asset inside a larger analytical project — not the whole project.
Capability-Based Structure¶
Folders are capability modules. An absent folder means that capability is not yet enabled. Projects start simple and grow capabilities over time.
| Template | topology | properties | documents | provenance | embeddings | indexes | workflows | artifacts | sync |
|---|---|---|---|---|---|---|---|---|---|
| Minimal | yes | yes | — | — | — | — | — | — | — |
| Exploratory investigation | yes | yes | yes | yes | — | — | — | — | — |
| Full workbench | yes | yes | yes | yes | yes | yes | yes | yes | yes |
| Published canonical | yes | yes | — | yes | — | — | — | — | — |
Full Directory Tree¶
my-investigation/
├── graphforge.yaml # project manifest (version, name, ontology path, ontology_mode, ir_version)
├── ontology.yaml # domain ontology (optional — see Section 3.5)
│
├── topology/ # hot path — identity + type only, no properties
│ ├── nodes.parquet # node_uuid(BINARY16), node_id(UInt64), type_id(UInt32), created_at, updated_at
│ └── edges/ # typed edge tables — one file per relation type
│ ├── WORKS_AT.parquet # edge_uuid, src_uuid, dst_uuid, edge_id(UInt64), src_id(UInt64), dst_id(UInt64), created_at, confidence, provenance_uuid
│ ├── OWNS.parquet
│ ├── MENTIONS.parquet
│ └── LOCATED_IN.parquet
│
├── properties/ # warm path — columnar property storage
│ ├── Person.parquet # node_uuid, name, email, age, ...
│ ├── Organization.parquet # node_uuid, org_name, country, ...
│ └── Document.parquet # doc_uuid, title, date, content_hash, ...
│
├── documents/ # raw source documents
│ ├── {doc_uuid}.pdf
│ └── {doc_uuid}.json
│
├── embeddings/ # vector stores, one subdirectory per model
│ ├── text-embedding-3/
│ │ └── {embedding_uuid}.parquet # entity_uuid, vector FIXED_SIZE_LIST(Float32, N)
│ └── custom-model/
│
├── provenance/ # provenance events + lineage graph
│ ├── events.parquet # provenance_uuid, kind, source_ref, analyst_id, created_at
│ └── lineage.parquet # parent_provenance_uuid, child_provenance_uuid
│
├── indexes/ # full-text search indexes (Tantivy)
│ └── Person/
│ └── tantivy/
│
├── workflows/ # saved analysis pipelines (serialised GraphPlan JSON)
│ └── {workflow_uuid}.json
│
├── artifacts/ # outputs — ranked lists, clusters, exported reports
│ ├── {artifact_uuid}_rank.parquet
│ └── {artifact_uuid}_cluster.parquet
│
└── sync/ # changeset-based collaboration (post-v0.5)
├── local/ # outbound changeset queue
├── inbound/ # received changesets
├── checkpoints/ # last-synced state per remote
└── merge_history/ # reconciliation log
Also present in exploratory mode:
├── topology/
│ └── edges/
│ └── _exploratory.parquet # catch-all for unknown relation types
└── topology/
└── runtime_catalog.parquet # RuntimeCatalog: observed labels/types/properties
Key Design Decisions¶
- Typed edge tables (
topology/edges/TYPENAME.parquet) — each relation type has its own Parquet file. DataFusion can scanWORKS_AT.parquetdirectly without filtering a unified edges table. Critical for 100M+ edge graphs. In exploratory mode, unknown relation types fall back to_exploratory.parquet. - Topology/properties split —
topology/holds only identity + type;properties/holds columns. Enables column pruning: a graph traversal never reads property data unless the query explicitly projects it. - Typed property files — one Parquet per entity type. Schema evolution is per-type. Adding a column to
Person.parquetdoes not affectOrganization.parquet. - Progressive ontology —
ontology.yamlis optional. See Section 3.5. - Capability modules — folders are capabilities. Absent folders are not-yet-enabled, not misconfigured. A project can start as
topology/ + properties/and addprovenance/,embeddings/, etc. incrementally.
3. Ontology Model¶
The ontology is already implemented in gf-ontology (M10). This section documents how it integrates with the new storage architecture.
The ontology.yaml file in the project root is the schema authority. It drives:
- The binder — resolves label strings to TypeId/PropId integers
- The relational lowering — tells the planner which typed edge file to scan for a given relation type
- The storage provider — maps entity type names to
properties/TYPENAME.parquetpaths - Validation — load-time validation of the graph against declared constraints
The ontology is not a database schema enforced on write. It is a semantic description that enables optimized planning and validation. Data that violates the ontology can still be stored; the planner and validator will flag violations.
The ontology is optional. See Section 3.5 for progressive ontology modes.
3.5. Progressive Ontology¶
GraphForge supports three ontology modes to accommodate the full analyst spectrum — from exploratory discovery to production-grade validated graphs. See ADR 0004 for the full rationale and design.
Three modes¶
| Mode | When | Binder behaviour | Violations |
|---|---|---|---|
exploratory |
No ontology.yaml present | Creates RuntimeTypeId for unknowns via RuntimeCatalog | None — all labels accepted |
advisory |
ontology.yaml present, mode unset or explicit | Warns on unknowns; adds to RuntimeCatalog | Warnings only |
strict |
ontology_mode: strict in graphforge.yaml |
BindError on any unknown | Hard errors |
graphforge.yaml manifest¶
The project manifest includes ontology_mode and an optional ontology path:
# Exploratory project — no ontology required
project_uuid: "0195f3a2-..."
name: "Investigation Alpha"
version: "1"
ontology_mode: exploratory
ontology: null
ir_version: "0.1.0"
graphforge_version: "0.5.0"
capabilities: # absent = false; only list enabled capabilities
topology: true
properties: true
documents: true
provenance: true
# Advisory project — ontology present but non-enforcing
project_uuid: "0195f3a3-..."
name: "Entity Graph v2"
version: "1"
ontology_mode: advisory # default when ontology.yaml present
ontology: "./ontology.yaml"
# Strict project — fully validated
project_uuid: "0195f3a4-..."
name: "Production Knowledge Base"
version: "1"
ontology_mode: strict # must be explicitly declared
ontology: "./ontology.yaml"
RuntimeCatalog¶
The RuntimeCatalog is an auto-growing registry present in all modes:
RuntimeCatalog
├── entity_types label → RuntimeTypeId (UInt32, session-local)
├── relation_types name → RuntimeTypeId
├── property_names name → RuntimePropId
├── statistics observation counts, first/last-seen timestamps
└── inference_hints suggested ontology entries (for export)
Persisted to topology/runtime_catalog.parquet. Can be exported to ontology.yaml via forge.suggest_ontology().
Storage in exploratory mode¶
Typed edge tables require predefined relation type names. In exploratory mode:
- Unknown relation types write to topology/edges/_exploratory.parquet (includes rel_type_name: Utf8 column)
- Known types (from RuntimeCatalog once promoted) use typed edge tables
- A maintenance operation can reorganise rows into typed files after the ontology is formalised
Binder open-world resolution (exploratory/advisory modes)¶
bind(label: &str) → TypeId:
1. Check OntologyHandle (if present): return TypeId if found
2. Check RuntimeCatalog: return existing RuntimeTypeId if previously seen
3. New label: RuntimeCatalog::intern(label) → new RuntimeTypeId
4. Return RuntimeTypeId for use in GraphOp
In strict mode, step 3 is replaced by BindError::UnknownLabel.
IR envelope¶
GraphPlan.ontology_version is optional:
GraphPlan {
ir_version: IrVersion,
dialect: String,
ontology_version: Option<OntologyVersion>, // None in exploratory mode
ontology_mode: OntologyMode, // exploratory | advisory | strict
ops: Vec<GraphOp>,
exprs: ExprArena,
}
Migration path¶
exploratory → advisory → strict
Transitions are always the analyst's choice. Moving to a stricter mode never deletes data — it only adds validation.
4. Graph IR Model¶
The Graph IR (gf-ir) is the stable semantic contract between the compiler and the execution engine. It is deliberately DataFusion-independent — the IR can be serialized, stored, inspected, and replayed without DataFusion being present.
Operators¶
| Operator | Description |
|---|---|
NodeScan { var, ty } |
Scan all nodes of a type (or all types if ty is None) |
TypedEdgeScan { var, rel_ty } |
NEW — scan a specific typed edge table directly |
Expand { src, edge, dst, rel_ty, dir, hops } |
Traverse edges from a bound node |
Filter { predicate } |
Apply a boolean expression |
Project { items, distinct } |
Select and alias output columns |
Aggregate { group_by, aggs } |
Grouping aggregations |
Sort { keys } |
ORDER BY |
Limit / Skip |
Pagination |
Optional { child } |
OPTIONAL MATCH |
Union { all, inputs } |
UNION / UNION ALL |
Unwind { list_expr, alias } |
List expansion |
With { items, where_predicate } |
Pipeline stage |
Create / Merge { pattern } |
Write operations (M13) |
TypedEdgeScan is the key addition for typed edge table storage — the binder emits this operator when a pattern specifies a relation type, bypassing a full edges scan.
Adjacency note (ADR 0005). The graph-native adjacency index introduces no new IR operator. Variable-length traversal remains encoded on
Expand { …, min_hops, max_hops }. Whether a traversal is executed via the adjacency index or a DataFusion hash join is a lowering/planner choice with identical semantics — not part of this stable semantic contract. See Execution Model §Adjacency-Backed Execution.
IR Envelope¶
GraphPlan {
ir_version: IrVersion, // semver, current = 0.1.0
dialect: String, // "opencypher-9" | "gql-1"
ontology_version: OntologyVersion, // checksum of ontology used at bind time
feature_flags: Vec<String>,
ops: Vec<GraphOp>,
exprs: ExprArena,
}
5. Identity Model¶
Decision: UUIDv7 as Canonical Identity¶
Every first-class object in GraphForge has a globally unique immutable identifier.
The chosen format is UUIDv7 (RFC 9562):
- Time-ordered: monotonically increasing within a millisecond window
- Globally unique without coordination
- 128-bit: fits in Arrow FixedSizeBinary(16)
- Suitable for Parquet row-group statistics (time-ordering enables range pruning)
- Supports offline generation on mobile devices, air-gapped systems, disconnected analysts
Objects Requiring UUID Identity¶
Every object in the following categories receives a UUID at creation time and retains it for its entire lifecycle, regardless of where it moves or how the project is restructured:
| Object | UUID column name | Notes |
|---|---|---|
| Node (entity) | node_uuid |
Canonical entity identity |
| Edge (relationship) | edge_uuid |
Canonical relationship identity |
| Document | doc_uuid |
Source document |
| Observation | obs_uuid |
A recorded observation |
| Assertion | assertion_uuid |
A derived or inferred claim |
| Provenance event | provenance_uuid |
Lineage record |
| Analyst / User | analyst_uuid |
Contributor identity |
| Project | project_uuid |
Project-level identity |
| Workflow | workflow_uuid |
Saved analysis pipeline |
| Embedding | embedding_uuid |
Vector artifact |
| Source reference | source_uuid |
External system / document source |
| Ranking output row | rank_uuid |
Reproducible result reference |
| Clustering output row | cluster_uuid |
Reproducible result reference |
| Generated artifact | artifact_uuid |
Any analytical output |
UUID + Surrogate Key Pattern¶
UUIDs are the canonical stable identity. Integer surrogate keys are a per-session execution optimization.
topology/nodes.parquet columns:
node_uuid BINARY(16) -- UUIDv7, globally unique, immutable
node_id UInt64 -- local surrogate, assigned at load time
type_id UInt32 -- ontology entity type integer
created_at Timestamp
updated_at Timestamp
topology/edges/WORKS_AT.parquet columns:
edge_uuid BINARY(16) -- UUIDv7
src_uuid BINARY(16) -- references node_uuid
dst_uuid BINARY(16) -- references node_uuid
edge_id UInt64 -- local surrogate
src_id UInt64 -- local surrogate (for DataFusion joins)
dst_id UInt64 -- local surrogate
created_at Timestamp
confidence Float64
provenance_uuid BINARY(16)
The relational lowering layer maps node_uuid → node_id once at scan time, then all DataFusion joins use integer surrogates. Results are projected back to UUIDs before returning to the user.
Architectural Principle¶
Entity identity is globally unique.
Storage location is temporary.
Ownership is temporary.
Projects are temporary.
Identity is permanent.
6. Provenance Model (Knowledge Layer)¶
Layer + status (ADR 0006). Provenance is a knowledge-layer concern, not a graph-layer one. Graph objects carry a
provenance_uuidreference; the events and lineage themselves live in the knowledge layer and are produced/interpreted there — never as graph semantics. Status: the schema below is Designed; provenance events are not yet written (gf-provenanceis a stub,provenance_uuidis currently NULL on write). Making this real — writing events, propagating confidence — is owned by the Knowledge Layer Foundation milestone. The epistemic extension (assertions, status, supersession, evidence, reasoning, bitemporal valid-time) builds on this schema and is specified in ADR 0007.
Every fact (node, edge, assertion, ranking output) carries a provenance_uuid that references an entry in provenance/events.parquet.
provenance/events.parquet:
provenance_uuid BINARY(16) -- UUID of this provenance event
kind Utf8 -- "ingestion" | "inference" | "assertion" | "merge"
source_ref BINARY(16) -- UUID of source document or system
analyst_uuid BINARY(16) -- UUID of the analyst who created this
rule_id Utf8 -- ontology inference rule ID (nullable)
confidence Float64 -- confidence score for this event
query_id Utf8 -- UUID of the query that produced this fact
created_at Timestamp
provenance/lineage.parquet:
parent_uuid BINARY(16) -- provenance event that contributed to this one
child_uuid BINARY(16) -- downstream provenance event
role Utf8 -- "input" | "derived" | "merged"
weight Float64 -- contribution weight
The conservative_min confidence policy: confidence of a derived fact = min(confidence of inputs).
7. Storage Architecture¶
Three Tiers¶
Tier 1: Topology (hot path — small files, frequently scanned)
The topology layer holds only identity and type. Graph traversal reads only this tier.
topology/nodes.parquet -- all nodes, UUID + type + timestamps
topology/edges/TYPENAME.parquet -- one file per relation type
Tier 2: Properties (warm path — larger files, column-pruned)
Property access requires a join from topology UUID to property file.
properties/ENTITY_TYPE.parquet -- one file per entity type, UUID + all properties
Tier 3: Analytical Artifacts (cold path — written once, read rarely)
documents/ raw source files
embeddings/ vector stores
provenance/ lineage records
indexes/ FTS indexes
workflows/ saved pipelines
artifacts/ ranked/clustered outputs
Performance Implications by Scale¶
| Scale | Critical optimizations |
|---|---|
| 1M–10M edges | Current design sufficient. Typed edge tables provide modest gains. |
| 10M–100M edges | Typed edge tables become important — avoid scanning all edges to find one type. Topology/properties split reduces traversal memory. |
| 100M–1B edges | Typed edge tables are mandatory. Partition by UUID prefix (first byte) for parallel scans. Surrogate IDs critical (avoid 128-bit joins in hot path). Consider time-range partitioning for temporal queries. |
| 1B+ edges | Out of scope for v0.5. Would require distributed execution (DataFusion Ballista or similar). |
8. Revised Milestone Plan¶
The v0.5 milestones were renumbered when the adjacency index (ADR 0005) was adopted as a
first-class layer. The Architecture Baseline was also moved to its true chronological slot
(M11, between Ontology Runtime and Graph IR). GitHub milestone IDs are immutable; the M##:
prefix in each title is the canonical sequence below.
| Milestone | State | Notes |
|---|---|---|
| M11: Architecture Baseline | closed | Decision-record sprint (this document, UUIDv7, typed edges, topology/properties split, progressive ontology). Formerly numbered M18. |
| M12: Graph IR | closed | GraphOp set incl. TypedEdgeScan; Expand { min_hops, max_hops } encodes variable-length. No VarLenExpand IR operator. |
| M13: Relational Lowering | closed | TypedEdgeScan → typed-edge TableProvider; UUID→surrogate mapping; VarLenExpandNode logical stub (#577). |
| M14: Execution Baseline | open | Custom physical nodes incl. VarLenExpand (#580). Open issues are an edge-property-persistence cluster, orthogonal to adjacency. |
| M15: Adjacency Index Baseline | open | NEW. Consolidates the two ephemeral adjacency builders (VarLenExpand build_adjacency, analyst-verb export_adjacency) into one derived AdjacencyProvider + CSR index under indexes/adjacency/. No IR change. Gates M18. See ADR 0005. |
| M16: Bindings Baseline | open | No adjacency dependency. |
| M17: Conformance Hardening | open | No adjacency dependency (fuzzing may exercise the adjacency path once it lands). |
| M18: Rank and Cluster | open | Depends on M15. export_adjacency (#610) becomes a provider adapter, not a second reader. |
| M19: Find | open | No adjacency dependency. Tantivy FTS + vector store, also under indexes/. |
| M20: Swift and Kotlin Bindings | open | v0.5.1. No adjacency dependency. |
Why a new milestone rather than folding adjacency into M18¶
Adjacency is already the root dependency of all five M18 verbs (#610) and already exists,
ephemerally, inside shipped M14 VarLenExpand (build_adjacency). Building the real layer once,
in its own milestone gated before M18, prevents a second throwaway adjacency reader and keeps
the IR (M12) and closed lowering issues (M13) untouched. See ADR 0005 for the full tradeoff
analysis.
Revised milestone dependency graph¶
M14 Execution Baseline
├──> M16 Bindings (parallel, no adj dep)
├──> M17 Conformance (parallel, no adj dep)
├──> M19 Find (parallel, no adj dep)
└──> M15 Adjacency Index ──> M18 Rank and Cluster ──> v0.5.0 close-out ──> M20 Swift/Kotlin (v0.5.1)
9. Migration Plan¶
There is no production data to migrate. The gf-storage and gf-exec crates are stubs returning NotImplemented. The gf-ontology implementation (M10) is compatible with the new architecture — the ontology drives the binder regardless of storage layout. The gf-ir primitives (M11 #565) are compatible.
Migration path: 1. Write this document (Architecture Baseline milestone) 2. Update M11 issues #566–#568 to reference typed edge tables and UUID identity 3. Implement M11 as planned 4. When implementing M12/M13 storage, adopt the new layout from the start 5. There is no "migration" from old to new storage — the storage implementation hasn't shipped yet
10. Performance Analysis¶
Current Roadmap (flat node_facts/edge_facts)¶
MATCH (a:Person)-[:WORKS_AT]->(b:Organization)
→ Scan all edges → Filter type = WORKS_AT → Join with nodes
At 100M edges: scanning all edges to find one type wastes I/O. No row-group pruning possible on type column when all types are mixed.
Revised Roadmap (typed edge tables)¶
MATCH (a:Person)-[:WORKS_AT]->(b:Organization)
→ TypedEdgeScan(WORKS_AT) → read WORKS_AT.parquet only → Join with topology
At 100M edges: if 1% of edges are WORKS_AT, we read 1M rows instead of 100M. 100x I/O reduction.
UUID Impact¶
UUID overhead: 16 bytes vs 4 bytes per ID = 4x per ID column. Mitigated by: - Surrogate integers for join columns (only UUID at API surface and for identity) - Arrow dictionary encoding for repeated values - Parquet binary column compression (UUIDs compress well with ZSTD)
Net additional storage: ~15–20% for UUID columns vs integer-only design. Acceptable given the merge/federation benefits.
11. Risks and Tradeoffs¶
| Decision | Benefit | Risk | Mitigation |
|---|---|---|---|
| UUIDv7 identity | Merge-safe, offline-safe, globally unique | 16 bytes vs 4 bytes; slower integer joins | Surrogate integer IDs for execution joins; UUID only at API surface |
| Typed edge tables | Direct scans, no type filter, partition stats | More files, schema evolution per-type | Ontology drives file layout; _exploratory.parquet catch-all in exploratory mode |
| Topology/properties split | Column pruning, less traversal memory | Extra join for property access | DataFusion handles this as a hash join; modern CPUs handle it cheaply |
| Project-centric structure | Multi-asset, future-proof, federation-ready | More complex than single graph file | graphforge.yaml manifest makes structure discoverable |
| Conservative min confidence | Simple, predictable, conservative | May undervalue high-quality inferences | Pluggable confidence policies; conservative_min is the safe default |
| Progressive ontology (3 modes) | Exploratory analysts can start immediately; structured projects get strict validation | Binder/storage provider complexity; advisory warning surface needed | Sensible defaults; RuntimeCatalog handles the exploratory→strict upgrade path |
12. Collaboration Architecture (post-v0.5)¶
Four concepts support multi-analyst collaboration. They are distinct and should remain so.
Changesets — unit of change¶
A changeset is the atomic unit of knowledge contribution. It carries a UUIDv7, analyst UUID, timestamp, and contains one or more mutations (node/edge additions, property changes, assertions) plus their provenance.
Sync — movement of changesets¶
Sync moves changesets between projects. It is transport-layer: local filesystem copy, HTTP push/pull, or air-gapped USB transfer.
sync/
├── local/ # outbound changeset queue
├── inbound/ # received changesets pending merge
├── checkpoints/ # last-synced state per remote
└── merge_history/ # reconciliation log
Merge — semantic reconciliation¶
Merge reconciles overlapping knowledge. It operates on UUID identity and entity resolution, not file diffs. Provenance is preserved through merge; confidence scoring resolves conflicts. Merge is deterministic given the same input changesets regardless of arrival order.
Federation — trust and policy boundary¶
Federation governs who can sync, what can sync, redaction rules, trust policies, audit requirements, and federated learning policies. Federation is not sync — it is the policy layer above sync.
Readiness assessment (v0.5)¶
| Concept | Ready now | Post-v0.5 |
|---|---|---|
| Changeset | UUIDv7 identity, provenance model | Formal changeset envelope, signing |
| Sync | Project directory portable (zip and share) | Protocol, remotes, incremental transfer |
| Merge | UUID union of topology files | Entity resolution, confidence merge, conflict UI |
| Federation | — | Auth, trust policy, redaction rules, audit |
The foundational decisions (UUIDv7 identity, project-centric structure, typed edge tables, provenance model) make all four concepts architecturally possible. No structural changes will be needed — only new crates/services layered on top.
References¶
- Architecture Overview
- AST & Planning
- Storage Architecture
- Execution Model
- Algorithms
- RFC 9562 — UUIDv7 specification
- GraphAr project — concepts evaluated but not adopted as dependency