Knowledge Graph Construction — Research Findings¶

Issue: #449
Branch: main (v0.3.9 / v0.3.10-dev)
Date: 2026-05-04

Executive Summary¶

GraphForge's core MERGE-based KG construction pattern is solid: idempotent entity and relationship loading, provenance tracking, two-pass confidence updates, and the full export suite (to_dicts, to_networkx, to_igraph, to_dataframe) all work correctly in a 20-entity / 26-relationship pipeline test at comfortable speed (~28 ms initial load, 296 K nodes/sec bulk ingest). Two gaps block a complete use-case experience: shortestPath raises a parse-time SyntaxError rather than being unimplemented (the use-case doc shows it as working, issue #468), and four public API methods that users would reach for first — gf.labels(), gf.relationship_types(), gf.node_count(), gf.relationship_count() — do not exist despite the underlying data being readily available via gf.graph.get_statistics() and private indexes.

Code Snippet Pass/Fail Matrix¶

All snippets taken from docs/use-cases/knowledge-graph-construction.md and run in isolation against current main.

#	Snippet	Status	Notes
S1	Entity MERGE with `.format(label=...)`	PASS	—
S2	Relationship MERGE after MATCH	PASS	—
S3	`datetime()` in `ON CREATE SET`	PASS	—
S4	`CASE` + `coalesce` confidence update	PASS	—
S5	Entity dedup (edge migration + DETACH DELETE)	PASS¹	Two-query pattern in doc is correct
S6	Find all entities of a type	PASS	—
S7	Traverse + `type(r)`	PASS	—
S8	`shortestPath(...)`	RESOLVED	Now raises `NotImplementedError` with BFS workaround hint (PR #497)
S9	Aggregate by rel type	PASS	—
S10	2-hop var-length subgraph	PASS	—
S11	Complete worked example	PASS	—

¹ The doc's two-execute() transaction pattern is correct. A single-query variant that folds edge migration and DETACH DELETE into one query with WITH alias, canon, r, target WHERE r IS NOT NULL correctly silences the delete when the alias has no edges (standard openCypher WITH/WHERE semantics). The doc should note this as a footgun.

Extended Pipeline Test¶

Mock dataset: 8 entity types (Person, Organization, Technology, Concept, Document, Location, Event, License), 20 nodes, 26 relationships.

Results¶

Stage	Checks	Status	Timing
1 — Initial load (MERGE + ON CREATE SET)	2/2	PASS	28 ms
2 — Second pass (ON MATCH SET confirmations++)	2/2	PASS	16 ms
3 — Confidence update (CASE + coalesce)	2/2	PASS	< 1 ms
4 — Entity dedup (alias merge + DETACH DELETE)	2/2	PASS	< 1 ms
5 — Querying (traverse, aggregate, 2-hop, shared-dep, low-conf filter)	8/8	PASS	< 1 ms each
6 — Export (to_dicts, to_networkx, to_igraph, to_dataframe)	13/13	PASS	< 5 ms
7 — Bulk ingest (10 K nodes)	2/2	PASS	34 ms (296 K nodes/sec)

Total: 31/31 checks — all pass.

Runnable pipeline¶

from graphforge import GraphForge

ENTITIES = [
    {"label": "Person",       "name": "Guido van Rossum", "description": "Creator of Python"},
    {"label": "Organization", "name": "Google",           "description": "Tech giant"},
    {"label": "Technology",   "name": "Python",           "description": "High-level language"},
    # … 17 more
]
RELATIONSHIPS = [
    {"from": "Guido van Rossum", "to": "Python",  "type": "CREATED",    "conf": 1.0},
    {"from": "Google",           "to": "Python",  "type": "DEPENDS_ON", "conf": 0.9},
    # … 24 more
]

gf = GraphForge()
SOURCE, MODEL = "doc:research:pipeline-test", "gpt-4o-mock"

gf.begin()
for ent in ENTITIES:
    gf.execute(
        """
        MERGE (e:{label} {{name: $name}})
        ON CREATE SET e.description = $desc,
                      e.source      = $source,
                      e.extractedBy = $model,
                      e.extractedAt = datetime()
        """.format(label=ent["label"]),
        {"name": ent["name"], "desc": ent["description"], "source": SOURCE, "model": MODEL},
    )
for rel in RELATIONSHIPS:
    gf.execute(
        """
        MATCH (a {{name: $from_name}})
        MATCH (b {{name: $to_name}})
        MERGE (a)-[r:{rel_type}]->(b)
        ON CREATE SET r.confidence = $conf, r.confirmations = 1, r.extractedAt = datetime()
        """.format(rel_type=rel["type"]),
        {"from_name": rel["from"], "to_name": rel["to"], "conf": rel["conf"]},
    )
gf.commit()

# Export
nodes_df = gf.to_dataframe("MATCH (n) RETURN labels(n)[0] AS label, n.name AS name")
G = gf.to_networkx()

Friction Points¶

FP-1 — Dynamic labels require unsafe string interpolation (→ #471)¶

Problem: openCypher grammar restricts labels to IDENTIFIER terminals. $label in label position is a SyntaxError. Users must build queries via string formatting with no sanitization layer.

# FAILS — parameterized label not supported
gf.execute("MATCH (n:$label) RETURN n", {"label": "Person"})
# SyntaxError: Unexpected token Token('DOLLAR', '$') at line 1, column 10.

# Required workaround — injection risk if label is user-supplied
label = "Person"
gf.execute(f"MATCH (n:{label}) RETURN n")

Impact: High. Every LLM extraction pipeline uses dynamic entity types. Users who come from Neo4j (which also lacks parameterized labels) know this pattern, but it's still the most-asked-about footgun in graph DB onboarding.

Neo4j / KùzuDB: Neither supports parameterized labels in Cypher either. Neo4j's apoc.merge.node() accepts a list of labels as a parameter. KùzuDB requires schema DDL (CREATE NODE TABLE) so the label is always static.

FP-2 — No schema introspection API (→ #469) ✅ Resolved in v0.3.10¶

Resolution (PR #499): All four methods now exist:

gf.labels()              # ['Organization', 'Person', 'Technology']
gf.relationship_types()  # ['CREATED', 'DEPENDS_ON']
gf.node_count()          # 20
gf.node_count("Person")  # 8
gf.relationship_count()  # 26

Original problem: gf.labels(), gf.relationship_types(), gf.node_count(), and gf.relationship_count() did not exist. The workaround required reaching into private _label_index / _type_index structures and calling gf.graph.get_statistics() with undocumented field names.

FP-3 — No constraint enforcement (→ #473)¶

Problem: CREATE happily creates duplicate nodes. MERGE is the only idempotency mechanism, and it requires per-query discipline.

gf.execute("CREATE (:Person {name: 'Alice'})")
gf.execute("CREATE (:Person {name: 'Alice'})")  # no error — two Alice nodes now exist
count = gf.execute("MATCH (n:Person {name:'Alice'}) RETURN count(n) AS c")[0]["c"].value
# count == 2

# Workaround: always use MERGE
gf.execute("MERGE (:Person {name: 'Alice'})")
gf.execute("MERGE (:Person {name: 'Alice'})")  # idempotent — one node

Impact: Medium. Users familiar with openCypher know to use MERGE. New users and LLM-generated code both default to CREATE. No uniqueness constraint DDL (CREATE CONSTRAINT) means the burden is entirely on the query author.

Neo4j: CREATE CONSTRAINT ON (n:Person) ASSERT n.name IS UNIQUE KùzuDB: Schema DDL (CREATE NODE TABLE) enforces primary keys.

FP-4 — Export asymmetry: no `to_json()` / `from_json()` (→ #470) ✅ Resolved in v0.3.10¶

Resolution (PR #500): Both methods now exist as first-class API:

gf.to_json("kg.json", metadata={"name": "My KG"})   # export
gf2 = GraphForge.from_json("kg.json")                # round-trip import

Original problem: to_json() and from_json() did not exist. The workaround required manually importing JSONGraphExporter, constructing it, and calling .export() — four steps with no round-trip import path at all.

FP-5 — Bulk ingest requires a context manager loop, no vectorized call (→ #472) ✅ Resolved in v0.3.10¶

Resolution (PR #502): add_graph_documents() accepts a list of LangChain GraphDocument objects or plain dicts in a single call:

gf.add_graph_documents([
    {
        "nodes": [{"id": "Python", "type": "Technology", "properties": {}}],
        "relationships": [],
    },
    ...
])

Nodes are merged by id + label (idempotent); relationships deduplicated on (source, target, type, properties). All labels and relationship types validated before any writes.

Original problem: There was no batch API for LangChain-style extraction output. The with bulk_ingest(): create_node_bulk(...) pattern was fast but required boilerplate, and had no idempotency guarantees.

FP-6 — `shortestPath` raises `SyntaxError` at parse time ✅ Resolved in v0.3.10¶

Resolution (PR #497): shortestPath() and allShortestPaths() now parse correctly and raise NotImplementedError with a BFS workaround hint:

gf.execute("MATCH p = shortestPath((a)-[*]-(b)) RETURN p")
# NotImplementedError: shortestPath() is not yet implemented.
# Use the BFS workaround: MATCH path = (a)-[*1..N]-(b)
#   RETURN length(path) ORDER BY length(path) LIMIT 1

# Workaround:
result = gf.execute(
    "MATCH path = (a:Technology {name: $src})-[*1..6]-(b:Technology {name: $dst}) "
    "RETURN length(path) AS hops ORDER BY hops LIMIT 1",
    {"src": "Python", "dst": "SQLite"},
)

Original problem: The parser had no grammar rule for shortestPath. Calling it raised a confusing SyntaxError mentioning LPAR, not a clear "not implemented" message.

Competitive Analysis¶

Capability	GraphForge	Neo4j	KùzuDB	LangChain integration
Parameterized labels	No (string interpolation only)	No (APOC workaround)	No (schema DDL)	N/A
Schema introspection	`gf.labels()`, `gf.relationship_types()` ✅	`db.labels()`, `db.schema`	`CALL show_tables()`	`graph.get_schema`
Uniqueness constraints	No DDL — MERGE is sole mechanism	`CREATE CONSTRAINT`	Primary key in DDL	N/A
JSON export	`gf.to_json()` / `gf.from_json()` ✅	`apoc.export.json.all()`	`COPY TO`	`graph.to_json()`
Vectorized node batch	`add_graph_documents()` ✅	`UNWIND $batch CREATE`	`COPY FROM` CSV	`add_graph_documents()`
Shortest path	`NotImplementedError` + BFS workaround ✅	`shortestPath()` + GDS library	`shortestPath()`	N/A
LangChain `GraphDocument` ingest	`add_graph_documents()` ✅	`Neo4jGraph.add_graph_documents()`	No	Direct

What Neo4j does better for KG construction¶

UNWIND $batch CREATE (n:$label) with apoc.merge.node() for dynamic labels
Constraint DDL ensures uniqueness at the DB level, not query level
db.schema.visualization() returns a full schema with cardinality info
shortestPath() plus the Graph Data Science library for full path algorithms

What KùzuDB does better¶

Schema-first DDL (CREATE NODE TABLE, CREATE REL TABLE) eliminates duplicate nodes by design; every node type has a typed primary key
COPY FROM csv_file TO NodeTable for bulk ingest without Python loop boilerplate

What LangChain does better for ecosystem fit¶

# LangChain GraphDocument standard — one call, structured output
from langchain_community.graphs.graph_document import GraphDocument, Node, Relationship

doc = GraphDocument(
    nodes=[
        Node(id="python", type="Technology", properties={"description": "..."}),
        Node(id="pytorch", type="Technology", properties={"description": "..."}),
    ],
    relationships=[
        Relationship(source=Node(id="pytorch", type="Technology"),
                     target=Node(id="python",  type="Technology"),
                     type="DEPENDS_ON"),
    ],
)
graph.add_graph_documents([doc])  # Neo4jGraph, MemgraphGraph, etc.

GraphForge has no equivalent. Users who switch from LangChain's Neo4j integration must rewrite their extraction pipeline entirely.

Prioritized Recommendations¶

Priority is based on: (1) how often a KG construction user hits this, (2) implementation effort, (3) competitive gap size.

R-1 — Schema introspection methods (Priority: High / Effort: Low) — issue #469¶

The data already exists in gf.graph.get_statistics() and _label_index / _type_index. This is a thin public API wrapper.

# Proposed signatures
def labels(self) -> list[str]: ...
def relationship_types(self) -> list[str]: ...
def node_count(self, label: str | None = None) -> int: ...
def relationship_count(self, rel_type: str | None = None) -> int: ...

Implementation: In src/graphforge/api.py, delegate to: - sorted(self.graph._label_index.keys()) - sorted(self.graph._type_index.keys()) - self.graph.get_statistics().total_nodes (or label-filtered len of _label_index[label]) - self.graph.get_statistics().total_edges (or type-filtered)

R-2 — `to_json()` / `from_json()` methods (Priority: Medium / Effort: Low) — issue #470¶

Wraps the existing JSONGraphExporter into the same calling style as to_networkx().

def to_json(
    self,
    path: Path | str,
    metadata: dict[str, Any] | None = None,
) -> None: ...

@classmethod
def from_json(cls, path: Path | str) -> "GraphForge": ...

Implementation: to_json delegates to JSONGraphExporter().export(self, Path(path), metadata). from_json uses the matching JSONGraphImporter if it exists, or raises NotImplementedError as a placeholder. Adds to_json/from_json to the export family in api.py.

R-3 — `shortestPath` graceful degradation (Priority: High / Effort: Low–Medium) — issue #468¶

At minimum, parse shortestPath(...) as a recognized construct and raise NotImplementedError with a clear message. Full BFS implementation is a separate story.

# In cypher.lark — add to path_pattern rule:
path_function: "shortestPath" "(" pattern_element ")"
             | "allShortestPaths" "(" pattern_element ")"

Implementation (graceful degradation): 1. Add grammar rule in cypher.lark 2. Add transformer method in parser.py producing ShortestPathCall AST node 3. In planner: emit a ShortestPath operator 4. In executor: raise NotImplementedError("shortestPath is not yet implemented — see issue #468")

Full BFS implementation is tracked in #468.

R-4 — Merge helper for dynamic labels (Priority: Medium / Effort: Medium) — issue #471¶

Eliminates unsafe string interpolation for the most common KG operation.

def merge_node(
    self,
    labels: list[str],
    match_on: dict[str, Any],
    on_create: dict[str, Any] | None = None,
    on_match: dict[str, Any] | None = None,
) -> NodeRef:
    """Idempotent node upsert with safe label injection."""

Implementation: Validates each label against re.fullmatch(r"[A-Za-z_][A-Za-z0-9_]*", label) (raising ValueError on invalid input), then builds and executes the MERGE query internally. This makes labels safe for dynamic use without exposing raw string interpolation to callers.

R-5 — LangChain-compatible `add_graph_documents()` (Priority: Low–Medium / Effort: Medium) — issue #472¶

Closes the ecosystem gap for users migrating from LangChain + Neo4j.

from graphforge.integrations.langchain import GraphDocument, Node, Relationship

def add_graph_documents(
    self,
    documents: list[GraphDocument],
    *,
    base_entity_label: bool = False,
    include_source: bool = False,
) -> None: ...

Implementation: Translate each Node/Relationship into MERGE calls internally, reusing merge_node() from R-4 for safe label injection. Mirror LangChain's Neo4jGraph.add_graph_documents() signature for drop-in compatibility.

Dedup Pattern Clarification (for doc update)¶

The single-query dedup pattern with WITH ... WHERE r IS NOT NULL before DETACH DELETE is a footgun: when the alias has no outgoing edges, the WHERE filters out all rows and the delete never fires. The correct pattern uses two separate execute() calls inside a transaction (as shown in the current doc), or uses a CALL {} subquery to scope the edge migration independently:

# ✅ Correct — two-query transaction (current doc pattern)
gf.begin()
gf.execute("""
    MATCH (alias:Technology {name: 'OldName'})
    MATCH (canon:Technology {name: 'NewName'})
    WITH alias, canon
    MATCH (alias)-[r]->(target)
    MERGE (canon)-[:IMPLEMENTED_IN {confidence: r.confidence}]->(target)
""")
gf.execute("MATCH (alias:Technology {name: 'OldName'}) DETACH DELETE alias")
gf.commit()

# ✅ Also correct — CALL subquery scopes edge migration independently
gf.execute("""
    MATCH (alias:Technology {name: 'OldName'}), (canon:Technology {name: 'NewName'})
    OPTIONAL MATCH (alias)-[r]->(target)
    WITH alias, canon, r, target
    CALL {
        WITH alias, canon, r, target
        WITH alias, canon, r, target WHERE r IS NOT NULL
        MERGE (canon)-[:IMPLEMENTED_IN]->(target)
    }
    DETACH DELETE alias
""")

# ❌ Incorrect — WHERE filters rows, DETACH DELETE silently skipped when no edges
gf.execute("""
    MATCH (alias:Technology {name: 'OldName'})
    MATCH (canon:Technology {name: 'NewName'})
    WITH alias, canon
    MATCH (alias)-[r]->(target)           # ← empty result when alias has no edges
    MERGE (canon)-[:IMPLEMENTED_IN]->(target)
    WITH alias
    DETACH DELETE alias                    # ← never reached
""")

The doc should add a note pointing users to the two-query transaction pattern and warning that folding delete into an edge-migration query requires OPTIONAL MATCH.

Ontology-Constrained KG Construction¶

Question: Can GraphForge be used in a workflow where an ontology constrains what entities and relationships can be created — ensuring the resulting KG conforms to a schema? The intent is not to add this as a first-party GF feature but to validate that GF works well as the store within this pattern.

Short answer: Yes, and all three viable approaches work correctly today. The constraint logic lives entirely outside GF in each case; GF is the store. Pydantic is the right tool for fail-fast validation at the Python layer.

Approach A — Pydantic validates before ingest¶

Pattern: Define Pydantic models that mirror the ontology. All LLM extraction output is validated against those models before any gf.execute() call runs. If validation fails, the whole batch is rejected — nothing reaches the graph.

from pydantic import BaseModel, Field, model_validator
from typing import Any, Optional
from graphforge import GraphForge

# Ontology defined as Python dicts (single source of truth)
CLASSES = {
    "Person":       {"required": ["name"],               "optional": ["description","url","confidence"]},
    "Technology":   {"required": ["name","description"], "optional": ["version","confidence"]},
    "Concept":      {"required": ["name"],               "optional": ["description","confidence"]},
}
RELS = {
    "DEPENDS_ON": {"domain": ["Technology"], "range": ["Technology"]},
    "RESEARCHES": {"domain": ["Person"],     "range": ["Concept"]},
    "CREATED":    {"domain": ["Person"],     "range": ["Technology"]},
}

class NodeRecord(BaseModel):
    type: str
    name: str
    description: Optional[str] = None
    confidence: float = Field(default=1.0, ge=0.0, le=1.0)

    @model_validator(mode="after")
    def check_class_and_required(self) -> "NodeRecord":
        if self.type not in CLASSES:
            raise ValueError(f"Unknown class '{self.type}'. Allowed: {sorted(CLASSES)}")
        for req in CLASSES[self.type]["required"]:
            if getattr(self, req, None) is None:
                raise ValueError(f"Class '{self.type}' requires property '{req}'")
        return self

class RelRecord(BaseModel):
    from_name: str; from_type: str; to_name: str; to_type: str; rel_type: str

    @model_validator(mode="after")
    def check_rel(self) -> "RelRecord":
        c = RELS.get(self.rel_type)
        if c is None:
            raise ValueError(f"Unknown rel type '{self.rel_type}'")
        if self.from_type not in c["domain"]:
            raise ValueError(f"'{self.rel_type}' domain must be {c['domain']}, got '{self.from_type}'")
        if self.to_type not in c["range"]:
            raise ValueError(f"'{self.rel_type}' range must be {c['range']}, got '{self.to_type}'")
        return self

class ExtractionBatch(BaseModel):
    nodes: list[NodeRecord]
    relationships: list[RelRecord]
    source: str

# Valid batch — ingests cleanly
batch = ExtractionBatch(
    source="doc:example",
    nodes=[
        NodeRecord(type="Technology", name="PyTorch", description="ML framework"),
        NodeRecord(type="Person",     name="Yann LeCun"),
        NodeRecord(type="Concept",    name="Deep Learning"),
    ],
    relationships=[
        RelRecord(from_name="PyTorch", from_type="Technology",
                  to_name="Deep Learning", to_type="Concept", rel_type="DEPENDS_ON"),
        # ↑ FAIL: DEPENDS_ON range is Technology, not Concept — Pydantic catches this
    ],
)

# Invalid batches caught before ingest:
# NodeRecord(type="Animal", name="Cat")          → "Unknown class 'Animal'"
# NodeRecord(type="Technology", name="X")        → "Technology requires property 'description'"
# RelRecord(..., rel_type="HATES")               → "Unknown rel type 'HATES'"
# RelRecord(from_type="Concept", rel_type="DEPENDS_ON") → "domain must be ['Technology']"

Characteristics: - Validation is instantaneous — no graph I/O before the check - All-or-nothing: one bad record in a batch of 100 rejects the whole batch (Pydantic's default) - The CLASSES/RELS dicts are the single source of truth for both Pydantic and ingest logic - Works naturally with LangChain structured output, instructor, and pydantic-ai — LLMs can be prompted to produce ExtractionBatch-shaped JSON directly - No ontology persistence — schema lives only in Python

Limitation: The ontology is not queryable at runtime. You cannot ask the graph "what classes are defined?" or "what relationships are valid for Technology?" — that information exists only in the Python module.

Approach B — Ontology-as-graph in GF, conformance via Cypher¶

Pattern: Load the ontology itself into GF as nodes and edges using reserved labels (:OntClass, :OntProp, :OntRel). Instance data lives in the same graph with normal labels. Run Cypher conformance queries after ingest to detect violations.

# Load ontology schema into GF
def load_ontology(gf: GraphForge, classes: dict, rels: dict) -> None:
    gf.begin()
    for cls_name, cls_def in classes.items():
        gf.execute("MERGE (:OntClass {name: $name})", {"name": cls_name})
        for prop in cls_def["required"]:
            gf.execute("""
                MERGE (p:OntProp {name: $prop, required: true, appliesTo: $cls})
                WITH p MATCH (c:OntClass {name: $cls}) MERGE (c)-[:REQUIRES]->(p)
            """, {"prop": prop, "cls": cls_name})
    for rel_name, rel_def in rels.items():
        gf.execute("MERGE (:OntRel {name: $name})", {"name": rel_name})
        for d in rel_def["domain"]:
            gf.execute("MATCH (rt:OntRel {name:$rt}),(c:OntClass {name:$cls}) MERGE (rt)-[:DOMAIN]->(c)",
                       {"rt": rel_name, "cls": d})
        for r in rel_def["range"]:
            gf.execute("MATCH (rt:OntRel {name:$rt}),(c:OntClass {name:$cls}) MERGE (rt)-[:RANGE]->(c)",
                       {"rt": rel_name, "cls": r})
    gf.commit()

# Query the ontology at runtime
valid_from_tech = gf.execute("""
    MATCH (c:OntClass {name: 'Technology'})<-[:DOMAIN]-(rt:OntRel)
    RETURN rt.name AS rel_type ORDER BY rel_type
""")
# → ['DEPENDS_ON', 'EMBODIES', 'IMPLEMENTS']

# Post-ingest conformance check
def find_domain_violations(gf: GraphForge, rt: str, allowed_domains: list[str]) -> list[dict]:
    rows = gf.execute(f"""
        MATCH (a)-[r:{rt}]->(b)
        WHERE NOT a:OntClass AND NOT a:OntProp AND NOT a:OntRel
        RETURN a.name AS from_node, labels(a)[0] AS from_label
    """)
    return [
        {"rel": rt, "from": r["from_node"].value, "label": r["from_label"].value}
        for r in rows if r["from_label"].value not in allowed_domains
    ]

Characteristics: - The ontology is a first-class citizen of the graph — queryable via Cypher like any other data - Enables dynamic workflows: "fetch all valid rel types for this class before building a prompt" - Catches violations in data loaded by any path (raw Cypher, bulk_ingest, scripts) - GF limitation: EXISTS {} subqueries are not supported (planner validation error) — conformance queries must use explicit per-class/per-rel-type loops in Python rather than single declarative Cypher - Schema and instance data share the same graph — requires label discipline (:OntClass etc.) to avoid mixing ontology nodes with instance nodes in queries

Key limitation found: The most natural Cypher conformance pattern uses EXISTS {} subqueries:

WHERE NOT EXISTS { MATCH (rt)-[:DOMAIN]->(d:Class) WHERE d.name IN labels(a) }

This raises a ValidationError in the GraphForge planner (

PropertyAccess requires either
'variable' or 'base'

). Workaround: run conformance checks as Python loops over per-class or per-rel-type queries rather than a single declarative Cypher statement.

Approach C — Hybrid: Pydantic + GF ontology graph (recommended)¶

Pattern: CLASSES/RELS are defined once as Python dicts. Pydantic models are generated from them for fail-fast validation. The same dicts also populate a GF ontology graph for runtime introspection and post-ingest auditing. Both layers reference the same source of truth so they cannot diverge.

# ── 1. Define ontology once ────────────────────────────────────────────────────
CLASSES = {
    "Technology": {"required": ["name","description"], "optional": ["version","confidence"]},
    "Person":     {"required": ["name"],               "optional": ["description","confidence"]},
    # ...
}
RELS = {
    "DEPENDS_ON": {"domain": ["Technology"], "range": ["Technology"]},
    "RESEARCHES": {"domain": ["Person"],     "range": ["Concept"]},
    # ...
}

# ── 2. Pydantic layer: fail-fast before ingest ─────────────────────────────────
batch = ExtractionBatch(nodes=[...], relationships=[...], source="doc:run-1")
# ValidationError raised here if anything violates the ontology — nothing reaches GF

# ── 3. Ingest validated data ──────────────────────────────────────────────────
ingest_batch(gf, batch)

# ── 4. GF ontology: runtime introspection ─────────────────────────────────────
schema = get_schema_for_class(gf, "Technology")
# {"required": ["description","name"], "optional": [...],
#  "outgoing_rels": ["DEPENDS_ON","EMBODIES","IMPLEMENTS"],
#  "incoming_rels": ["CREATED","DEPENDS_ON","MAINTAINS"]}

# ── 5. GF ontology: post-ingest audit (catches raw-Cypher bypasses) ────────────
violations = post_ingest_audit(gf)
# {"missing_required_properties": [{"class":"Technology","node":"SketchyLib","property":"description"}],
#  "domain_violations": [...], "unknown_rel_types": [...]}

# ── 6. GF ontology: generate LLM system prompt from live schema ────────────────
for row in gf.execute("MATCH (c:OntClass) RETURN c.name AS name ORDER BY name"):
    cls = row["name"].value
    s = get_schema_for_class(gf, cls)
    print(f"  - {cls} (required: {', '.join(s['required'])})")
# - Concept (required: name)
# - Technology (required: description, name)
# ...

Characteristics: - Single source of truth: add a class or rel type in one dict, both layers update automatically - Pydantic handles 99% of violations at zero cost before any graph I/O - GF ontology graph handles the remaining 1%: data written by scripts, imports, or other tools that bypassed Pydantic validation - The ontology graph also enables a powerful bonus: generating LLM prompts dynamically from the live schema — the graph drives what the LLM is asked to extract, so adding a new class automatically updates all downstream prompts - Same EXISTS {} limitation as Approach B applies; conformance audit uses Python loops

Comparison¶

Dimension	A: Pydantic only	B: GF ontology only	C: Hybrid (recommended)
Fail-fast validation	Yes — before any I/O	No — post-ingest only	Yes
Catches non-Pydantic writes	No	Yes	Yes
Ontology queryable at runtime	No	Yes	Yes
LLM prompt generation from schema	No	Yes	Yes
Single source of truth	Yes (Python dicts)	No (must keep GF and code in sync)	Yes (Python dicts → both layers)
`EXISTS {}` subquery support needed	No	Yes (workaround needed)	Yes (workaround needed)
GF version required	Current	Current	Current
Complexity	Low	Medium	Medium

Recommendation: Approach C. The extra cost over A is loading the ontology into GF once at startup (~50ms for a typical ontology). The payoff is: runtime schema introspection, post-ingest audit for any bypass path, and dynamic LLM prompt generation — all via the same Cypher interface as the instance data.

GF limitation surfaced: `EXISTS {}` subquery in planner¶

The most natural Cypher for conformance checking (single declarative query finding all violations) requires EXISTS {} subqueries with outer variable binding:

MATCH (a)-[r]->(b)
WHERE NOT a:OntClass
  AND NOT EXISTS { MATCH (rt:OntRel {name: type(r)})-[:DOMAIN]->(d:OntClass) WHERE d.name IN labels(a) }
RETURN type(r), labels(a), a.name

This raises ValidationError: PropertyAccess requires either 'variable' or 'base' in the planner. The workaround — Python loops calling per-class/per-rel-type Cypher — is functional but verbose. A fully declarative conformance query would be cleaner. Tracked in issue #474.