ADR 0004: Progressive Ontology — Exploration First¶

Status: Accepted
Date: 2026-05-31
Supersedes: Implicit ontology-required assumption in ADR 0002

Context¶

GraphForge is a knowledge analysis workbench for discovery-oriented analysts. Its primary workloads — OSINT, intelligence analysis, genealogy, entity resolution, investigative journalism, fraud investigation, due diligence, and knowledge construction — all share a common characteristic:

The analyst begins with uncertainty and discovers structure over time.

The analyst does not know what entity types exist before the analysis begins. They encounter raw data, extract entities, observe relationships, and progressively understand the domain. Imposing an ontology before analysis begins is not just inconvenient — it inverts the natural workflow and blocks the primary use case.

At the same time, GraphForge must support curated knowledge bases and validated production graphs where strict schema enforcement is required for correctness, compliance, and reproducibility.

The existing requirements.md already mandates:

"Data models MUST be optional. MUST NOT be required to create or query graphs. MUST NOT prevent insertion of incomplete or uncertain data by default."

However, the v0.5 architecture as documented implies ontology-first operation throughout: - The binder produces BindError for unknown labels/types/properties - Typed edge tables require ontology-driven file layout - The GraphPlan IR envelope carries ontology_version as a required field - The project layout documentation assumes ontology.yaml is always present

This ADR resolves the contradiction by formalising three progressive ontology modes.

Decision¶

Ontology is progressive, not required.

GraphForge supports three ontology modes that determine how the system responds to labels, relation types, and properties that are not (or not yet) defined in a formal ontology.

Mode 1: `exploratory`¶

No ontology required. All labels and types accepted. No rejections.

Analysts may create arbitrary labels, relation types, and properties
The system accepts heterogeneous and messy data without complaint
A RuntimeCatalog auto-assigns integer IDs for all observed labels/types/properties
The RuntimeCatalog is persisted to topology/runtime_catalog.parquet
Typed edge tables are not used in exploratory mode — all edges go to topology/edges/_exploratory.parquet
The IR envelope carries ontology_version: None
This is the default mode when no ontology.yaml is present

Mode 2: `advisory`¶

Ontology present, violations are warnings — not errors.

Analysts have defined (or partially defined) an ontology
Known labels/types/properties are resolved via the OntologyHandle as normal
Unknown labels/types/properties produce a warning and are added to the RuntimeCatalog
The RuntimeCatalog tracks schema drift: new types, renamed types, property mismatches
Typed edge tables are used for known relation types; unknown types fall back to _exploratory.parquet
The system may suggest ontology improvements but never blocks data ingestion or queries
This is the default mode when ontology.yaml is present and ontology_mode is not explicitly set

Mode 3: `strict`¶

Ontology required. Violations produce errors. No unknown types accepted.

Unknown labels produce BindError::UnknownLabel
Unknown relation types produce BindError::UnknownRelationType
Unknown properties produce BindError::UnknownProperty
All edge types must be declared in the ontology; typed edge tables cover all edges
The IR envelope carries a required ontology_version
Appropriate for production deployments, validated knowledge graphs, compliance workflows
Must be explicitly enabled via ontology_mode: strict in graphforge.yaml

Default mode logic¶

Condition	Default mode
No `ontology.yaml`, no `ontology_mode` in manifest	`exploratory`
`ontology.yaml` present, no `ontology_mode` in manifest	`advisory`
`ontology_mode: strict` in manifest	`strict`
`ontology_mode: exploratory` in manifest	`exploratory` (override)
`ontology_mode: advisory` in manifest	`advisory` (override)

RuntimeCatalog¶

The RuntimeCatalog is an auto-growing registry of observed entity types, relation types, and property names that the system has encountered but that may not be formally defined in an ontology.

RuntimeCatalog
├── entity_types      label → RuntimeTypeId (UInt32, auto-assigned)
├── relation_types    name → RuntimeTypeId
├── property_names    name → RuntimePropId
├── statistics        observation counts, first-seen, last-seen timestamps
└── inference_hints   suggested ontology entries based on observed patterns

Key properties: - Always present (even in strict mode — records what was observed) - Persisted to topology/runtime_catalog.parquet - Exportable to ontology.yaml via forge.export_ontology() (future feature) - Survives project merges (UUIDs for RuntimeCatalog entries follow the same UUID identity model) - RuntimeTypeIds are local integers (not globally stable like UUID identity) — suitable for query planning within a session, not for cross-project references

RuntimeCatalog in the binder (exploratory/advisory modes):

bind(label: &str) → TypeId:
1. Check OntologyHandle (if present): return TypeId if found
2. Check RuntimeCatalog: return existing RuntimeTypeId if seen before
3. New label: RuntimeCatalog::intern(label) → new RuntimeTypeId
4. Record: observation count += 1
5. Return RuntimeTypeId for use in GraphOp

Storage consequences¶

Exploratory mode storage¶

Without a predefined ontology, the typed edge table layout cannot be pre-determined. Exploratory mode uses a unified fallback:

topology/edges/_exploratory.parquet   — all edges, includes rel_type_name: Utf8 column

As the RuntimeCatalog grows, the system can optionally reorganise into typed edge files via a maintenance operation (not blocking the write path).

Advisory mode storage¶

Known relation types use typed edge tables as normal. Unknown relation types append to _exploratory.parquet. When the ontology is later extended to cover a previously-unknown type, a migration operation can promote rows from _exploratory.parquet into the typed file.

Strict mode storage¶

All edge types are pre-declared; typed edge tables cover all edges; _exploratory.parquet is empty.

`graphforge.yaml` manifest changes¶

The project manifest gains an ontology_mode field:

# Exploratory project (no ontology)
project_uuid: "0195f3a2-..."
name: "OSINT Investigation Alpha"
version: "1"
ontology_mode: exploratory   # default when no ontology.yaml present
ontology: null

# Advisory project (ontology present, non-enforcing)
project_uuid: "0195f3a2-..."
name: "Entity Graph v2"
version: "1"
ontology_mode: advisory      # default when ontology.yaml present
ontology: "./ontology.yaml"

# Strict project (fully validated)
project_uuid: "0195f3a2-..."
name: "Production Knowledge Base"
version: "1"
ontology_mode: strict        # must be explicitly declared
ontology: "./ontology.yaml"

IR envelope changes¶

GraphPlan.ontology_version changes from required to optional:

GraphPlan {
    ir_version: IrVersion,
    dialect: String,
    ontology_version: Option<OntologyVersion>,  // None in exploratory mode
    ontology_mode: OntologyMode,                // new field
    feature_flags: Vec<String>,
    ops: Vec<GraphOp>,
    exprs: ExprArena,
}

Exploratory Analyst Persona¶

The Exploratory Analyst is a first-class design target for GraphForge.

Characteristics: - Does not know the ontology ahead of time - Works with messy, incomplete, or heterogeneous data - Progressively discovers entity types and relationships during analysis - Frequently renames concepts during investigation - Creates temporary entity types and relationship types - Builds graphs incrementally from multiple data sources - Begins with maximum uncertainty; refines toward structure

Representative users: - Intelligence analyst - OSINT investigator - Journalist - Genealogist - Academic researcher - Due diligence analyst - Fraud investigator - Cybersecurity analyst - Entity resolution engineer

See docs/guide/exploratory-analyst.md for the full persona and journey documentation.

Consequences¶

Positive: - GraphForge remains a workbench for discovery-oriented analysis - Onboarding friction is eliminated — no ontology required to start - Exploratory and structured workflows coexist without compromising either - RuntimeCatalog enables gradual schema discovery and promotes iterative refinement

Negative / risks: - Binder implementation is more complex — must handle three modes - Storage provider must handle the _exploratory.parquet fallback - RuntimeCatalog adds a persistent artifact to manage - Advisory mode warnings need a UX surface (CLI, API, Python binding)

Mitigations: - Mode defaults are sensible: analysts doing exploration get exploration; structured projects get structure - RuntimeCatalog is optional for callers who only use strict mode - The fallback _exploratory.parquet is a simple single-file catch-all with a rel_type_name: Utf8 column

References¶

docs/development/requirements.md — "Data models MUST be optional"
docs/architecture/refactor-v0.5.md — storage architecture, IR design
docs/architecture/storage.md — typed edge tables, topology/properties layout
docs/guide/exploratory-analyst.md — persona and journey documentation
676 — project-centric directory structure¶
677 — architecture milestone issue sweep¶