Skip to content

Knowledge Graph Construction — Research Findings

Issue: #449
Branch: main (v0.3.9 / v0.3.10-dev)
Date: 2026-05-04


Executive Summary

GraphForge's core MERGE-based KG construction pattern is solid: idempotent entity and relationship loading, provenance tracking, two-pass confidence updates, and the full export suite (to_dicts, to_networkx, to_igraph, to_dataframe) all work correctly in a 20-entity / 26-relationship pipeline test at comfortable speed (~28 ms initial load, 296 K nodes/sec bulk ingest). Two gaps block a complete use-case experience: shortestPath raises a parse-time SyntaxError rather than being unimplemented (the use-case doc shows it as working, issue #468), and four public API methods that users would reach for first — gf.labels(), gf.relationship_types(), gf.node_count(), gf.relationship_count() — do not exist despite the underlying data being readily available via gf.graph.get_statistics() and private indexes.


Code Snippet Pass/Fail Matrix

All snippets taken from docs/use-cases/knowledge-graph-construction.md and run in isolation against current main.

# Snippet Status Notes
S1 Entity MERGE with .format(label=...) PASS
S2 Relationship MERGE after MATCH PASS
S3 datetime() in ON CREATE SET PASS
S4 CASE + coalesce confidence update PASS
S5 Entity dedup (edge migration + DETACH DELETE) **PASS**¹ Two-query pattern in doc is correct
S6 Find all entities of a type PASS
S7 Traverse + type(r) PASS
S8 shortestPath(...) RESOLVED Now raises NotImplementedError with BFS workaround hint (PR #497)
S9 Aggregate by rel type PASS
S10 2-hop var-length subgraph PASS
S11 Complete worked example PASS

¹ The doc's two-execute() transaction pattern is correct. A single-query variant that folds edge migration and DETACH DELETE into one query with WITH alias, canon, r, target WHERE r IS NOT NULL correctly silences the delete when the alias has no edges (standard openCypher WITH/WHERE semantics). The doc should note this as a footgun.


Extended Pipeline Test

Mock dataset: 8 entity types (Person, Organization, Technology, Concept, Document, Location, Event, License), 20 nodes, 26 relationships.

Results

Stage Checks Status Timing
1 — Initial load (MERGE + ON CREATE SET) 2/2 PASS 28 ms
2 — Second pass (ON MATCH SET confirmations++) 2/2 PASS 16 ms
3 — Confidence update (CASE + coalesce) 2/2 PASS < 1 ms
4 — Entity dedup (alias merge + DETACH DELETE) 2/2 PASS < 1 ms
5 — Querying (traverse, aggregate, 2-hop, shared-dep, low-conf filter) 8/8 PASS < 1 ms each
6 — Export (to_dicts, to_networkx, to_igraph, to_dataframe) 13/13 PASS < 5 ms
7 — Bulk ingest (10 K nodes) 2/2 PASS 34 ms (296 K nodes/sec)

Total: 31/31 checks — all pass.

Runnable pipeline

from graphforge import GraphForge

ENTITIES = [
    {"label": "Person",       "name": "Guido van Rossum", "description": "Creator of Python"},
    {"label": "Organization", "name": "Google",           "description": "Tech giant"},
    {"label": "Technology",   "name": "Python",           "description": "High-level language"},
    # … 17 more
]
RELATIONSHIPS = [
    {"from": "Guido van Rossum", "to": "Python",  "type": "CREATED",    "conf": 1.0},
    {"from": "Google",           "to": "Python",  "type": "DEPENDS_ON", "conf": 0.9},
    # … 24 more
]

gf = GraphForge()
SOURCE, MODEL = "doc:research:pipeline-test", "gpt-4o-mock"

gf.begin()
for ent in ENTITIES:
    gf.execute(
        """
        MERGE (e:{label} {{name: $name}})
        ON CREATE SET e.description = $desc,
                      e.source      = $source,
                      e.extractedBy = $model,
                      e.extractedAt = datetime()
        """.format(label=ent["label"]),
        {"name": ent["name"], "desc": ent["description"], "source": SOURCE, "model": MODEL},
    )
for rel in RELATIONSHIPS:
    gf.execute(
        """
        MATCH (a {{name: $from_name}})
        MATCH (b {{name: $to_name}})
        MERGE (a)-[r:{rel_type}]->(b)
        ON CREATE SET r.confidence = $conf, r.confirmations = 1, r.extractedAt = datetime()
        """.format(rel_type=rel["type"]),
        {"from_name": rel["from"], "to_name": rel["to"], "conf": rel["conf"]},
    )
gf.commit()

# Export
nodes_df = gf.to_dataframe("MATCH (n) RETURN labels(n)[0] AS label, n.name AS name")
G = gf.to_networkx()

Friction Points

FP-1 — Dynamic labels require unsafe string interpolation (→ #471)

Problem: openCypher grammar restricts labels to IDENTIFIER terminals. $label in label position is a SyntaxError. Users must build queries via string formatting with no sanitization layer.

# FAILS — parameterized label not supported
gf.execute("MATCH (n:$label) RETURN n", {"label": "Person"})
# SyntaxError: Unexpected token Token('DOLLAR', '$') at line 1, column 10.

# Required workaround — injection risk if label is user-supplied
label = "Person"
gf.execute(f"MATCH (n:{label}) RETURN n")

Impact: High. Every LLM extraction pipeline uses dynamic entity types. Users who come from Neo4j (which also lacks parameterized labels) know this pattern, but it's still the most-asked-about footgun in graph DB onboarding.

Neo4j / KùzuDB: Neither supports parameterized labels in Cypher either. Neo4j's apoc.merge.node() accepts a list of labels as a parameter. KùzuDB requires schema DDL (CREATE NODE TABLE) so the label is always static.


FP-2 — No schema introspection API (→ #469) ✅ Resolved in v0.3.10

Resolution (PR #499): All four methods now exist:

gf.labels()              # ['Organization', 'Person', 'Technology']
gf.relationship_types()  # ['CREATED', 'DEPENDS_ON']
gf.node_count()          # 20
gf.node_count("Person")  # 8
gf.relationship_count()  # 26

Original problem: gf.labels(), gf.relationship_types(), gf.node_count(), and gf.relationship_count() did not exist. The workaround required reaching into private _label_index / _type_index structures and calling gf.graph.get_statistics() with undocumented field names.


FP-3 — No constraint enforcement (→ #473)

Problem: CREATE happily creates duplicate nodes. MERGE is the only idempotency mechanism, and it requires per-query discipline.

gf.execute("CREATE (:Person {name: 'Alice'})")
gf.execute("CREATE (:Person {name: 'Alice'})")  # no error — two Alice nodes now exist
count = gf.execute("MATCH (n:Person {name:'Alice'}) RETURN count(n) AS c")[0]["c"].value
# count == 2

# Workaround: always use MERGE
gf.execute("MERGE (:Person {name: 'Alice'})")
gf.execute("MERGE (:Person {name: 'Alice'})")  # idempotent — one node

Impact: Medium. Users familiar with openCypher know to use MERGE. New users and LLM-generated code both default to CREATE. No uniqueness constraint DDL (CREATE CONSTRAINT) means the burden is entirely on the query author.

Neo4j: CREATE CONSTRAINT ON (n:Person) ASSERT n.name IS UNIQUE KùzuDB: Schema DDL (CREATE NODE TABLE) enforces primary keys.


FP-4 — Export asymmetry: no to_json() / from_json() (→ #470) ✅ Resolved in v0.3.10

Resolution (PR #500): Both methods now exist as first-class API:

gf.to_json("kg.json", metadata={"name": "My KG"})   # export
gf2 = GraphForge.from_json("kg.json")                # round-trip import

Original problem: to_json() and from_json() did not exist. The workaround required manually importing JSONGraphExporter, constructing it, and calling .export() — four steps with no round-trip import path at all.


FP-5 — Bulk ingest requires a context manager loop, no vectorized call (→ #472) ✅ Resolved in v0.3.10

Resolution (PR #502): add_graph_documents() accepts a list of LangChain GraphDocument objects or plain dicts in a single call:

gf.add_graph_documents([
    {
        "nodes": [{"id": "Python", "type": "Technology", "properties": {}}],
        "relationships": [],
    },
    ...
])

Nodes are merged by id + label (idempotent); relationships deduplicated on (source, target, type, properties). All labels and relationship types validated before any writes.

Original problem: There was no batch API for LangChain-style extraction output. The with bulk_ingest(): create_node_bulk(...) pattern was fast but required boilerplate, and had no idempotency guarantees.


FP-6 — shortestPath raises SyntaxError at parse time ✅ Resolved in v0.3.10

Resolution (PR #497): shortestPath() and allShortestPaths() now parse correctly and raise NotImplementedError with a BFS workaround hint:

gf.execute("MATCH p = shortestPath((a)-[*]-(b)) RETURN p")
# NotImplementedError: shortestPath() is not yet implemented.
# Use the BFS workaround: MATCH path = (a)-[*1..N]-(b)
#   RETURN length(path) ORDER BY length(path) LIMIT 1

# Workaround:
result = gf.execute(
    "MATCH path = (a:Technology {name: $src})-[*1..6]-(b:Technology {name: $dst}) "
    "RETURN length(path) AS hops ORDER BY hops LIMIT 1",
    {"src": "Python", "dst": "SQLite"},
)

Original problem: The parser had no grammar rule for shortestPath. Calling it raised a confusing SyntaxError mentioning LPAR, not a clear "not implemented" message.


Competitive Analysis

Capability GraphForge Neo4j KùzuDB LangChain integration
Parameterized labels No (string interpolation only) No (APOC workaround) No (schema DDL) N/A
Schema introspection gf.labels(), gf.relationship_types() db.labels(), db.schema CALL show_tables() graph.get_schema
Uniqueness constraints No DDL — MERGE is sole mechanism CREATE CONSTRAINT Primary key in DDL N/A
JSON export gf.to_json() / gf.from_json() apoc.export.json.all() COPY TO graph.to_json()
Vectorized node batch add_graph_documents() UNWIND $batch CREATE COPY FROM CSV add_graph_documents()
Shortest path NotImplementedError + BFS workaround ✅ shortestPath() + GDS library shortestPath() N/A
LangChain GraphDocument ingest add_graph_documents() Neo4jGraph.add_graph_documents() No Direct

What Neo4j does better for KG construction

  • UNWIND $batch CREATE (n:$label) with apoc.merge.node() for dynamic labels
  • Constraint DDL ensures uniqueness at the DB level, not query level
  • db.schema.visualization() returns a full schema with cardinality info
  • shortestPath() plus the Graph Data Science library for full path algorithms

What KùzuDB does better

  • Schema-first DDL (CREATE NODE TABLE, CREATE REL TABLE) eliminates duplicate nodes by design; every node type has a typed primary key
  • COPY FROM csv_file TO NodeTable for bulk ingest without Python loop boilerplate

What LangChain does better for ecosystem fit

# LangChain GraphDocument standard — one call, structured output
from langchain_community.graphs.graph_document import GraphDocument, Node, Relationship

doc = GraphDocument(
    nodes=[
        Node(id="python", type="Technology", properties={"description": "..."}),
        Node(id="pytorch", type="Technology", properties={"description": "..."}),
    ],
    relationships=[
        Relationship(source=Node(id="pytorch", type="Technology"),
                     target=Node(id="python",  type="Technology"),
                     type="DEPENDS_ON"),
    ],
)
graph.add_graph_documents([doc])  # Neo4jGraph, MemgraphGraph, etc.

GraphForge has no equivalent. Users who switch from LangChain's Neo4j integration must rewrite their extraction pipeline entirely.


Prioritized Recommendations

Priority is based on: (1) how often a KG construction user hits this, (2) implementation effort, (3) competitive gap size.

R-1 — Schema introspection methods (Priority: High / Effort: Low) — issue #469

The data already exists in gf.graph.get_statistics() and _label_index / _type_index. This is a thin public API wrapper.

# Proposed signatures
def labels(self) -> list[str]: ...
def relationship_types(self) -> list[str]: ...
def node_count(self, label: str | None = None) -> int: ...
def relationship_count(self, rel_type: str | None = None) -> int: ...

Implementation: In src/graphforge/api.py, delegate to: - sorted(self.graph._label_index.keys()) - sorted(self.graph._type_index.keys()) - self.graph.get_statistics().total_nodes (or label-filtered len of _label_index[label]) - self.graph.get_statistics().total_edges (or type-filtered)


R-2 — to_json() / from_json() methods (Priority: Medium / Effort: Low) — issue #470

Wraps the existing JSONGraphExporter into the same calling style as to_networkx().

def to_json(
    self,
    path: Path | str,
    metadata: dict[str, Any] | None = None,
) -> None: ...

@classmethod
def from_json(cls, path: Path | str) -> "GraphForge": ...

Implementation: to_json delegates to JSONGraphExporter().export(self, Path(path), metadata). from_json uses the matching JSONGraphImporter if it exists, or raises NotImplementedError as a placeholder. Adds to_json/from_json to the export family in api.py.


R-3 — shortestPath graceful degradation (Priority: High / Effort: Low–Medium) — issue #468

At minimum, parse shortestPath(...) as a recognized construct and raise NotImplementedError with a clear message. Full BFS implementation is a separate story.

# In cypher.lark — add to path_pattern rule:
path_function: "shortestPath" "(" pattern_element ")"
             | "allShortestPaths" "(" pattern_element ")"

Implementation (graceful degradation): 1. Add grammar rule in cypher.lark 2. Add transformer method in parser.py producing ShortestPathCall AST node 3. In planner: emit a ShortestPath operator 4. In executor: raise NotImplementedError("shortestPath is not yet implemented — see issue #468")

Full BFS implementation is tracked in #468.


R-4 — Merge helper for dynamic labels (Priority: Medium / Effort: Medium) — issue #471

Eliminates unsafe string interpolation for the most common KG operation.

def merge_node(
    self,
    labels: list[str],
    match_on: dict[str, Any],
    on_create: dict[str, Any] | None = None,
    on_match: dict[str, Any] | None = None,
) -> NodeRef:
    """Idempotent node upsert with safe label injection."""

Implementation: Validates each label against re.fullmatch(r"[A-Za-z_][A-Za-z0-9_]*", label) (raising ValueError on invalid input), then builds and executes the MERGE query internally. This makes labels safe for dynamic use without exposing raw string interpolation to callers.


R-5 — LangChain-compatible add_graph_documents() (Priority: Low–Medium / Effort: Medium) — issue #472

Closes the ecosystem gap for users migrating from LangChain + Neo4j.

from graphforge.integrations.langchain import GraphDocument, Node, Relationship

def add_graph_documents(
    self,
    documents: list[GraphDocument],
    *,
    base_entity_label: bool = False,
    include_source: bool = False,
) -> None: ...

Implementation: Translate each Node/Relationship into MERGE calls internally, reusing merge_node() from R-4 for safe label injection. Mirror LangChain's Neo4jGraph.add_graph_documents() signature for drop-in compatibility.


Dedup Pattern Clarification (for doc update)

The single-query dedup pattern with WITH ... WHERE r IS NOT NULL before DETACH DELETE is a footgun: when the alias has no outgoing edges, the WHERE filters out all rows and the delete never fires. The correct pattern uses two separate execute() calls inside a transaction (as shown in the current doc), or uses a CALL {} subquery to scope the edge migration independently:

# ✅ Correct — two-query transaction (current doc pattern)
gf.begin()
gf.execute("""
    MATCH (alias:Technology {name: 'OldName'})
    MATCH (canon:Technology {name: 'NewName'})
    WITH alias, canon
    MATCH (alias)-[r]->(target)
    MERGE (canon)-[:IMPLEMENTED_IN {confidence: r.confidence}]->(target)
""")
gf.execute("MATCH (alias:Technology {name: 'OldName'}) DETACH DELETE alias")
gf.commit()

# ✅ Also correct — CALL subquery scopes edge migration independently
gf.execute("""
    MATCH (alias:Technology {name: 'OldName'}), (canon:Technology {name: 'NewName'})
    OPTIONAL MATCH (alias)-[r]->(target)
    WITH alias, canon, r, target
    CALL {
        WITH alias, canon, r, target
        WITH alias, canon, r, target WHERE r IS NOT NULL
        MERGE (canon)-[:IMPLEMENTED_IN]->(target)
    }
    DETACH DELETE alias
""")

# ❌ Incorrect — WHERE filters rows, DETACH DELETE silently skipped when no edges
gf.execute("""
    MATCH (alias:Technology {name: 'OldName'})
    MATCH (canon:Technology {name: 'NewName'})
    WITH alias, canon
    MATCH (alias)-[r]->(target)           # ← empty result when alias has no edges
    MERGE (canon)-[:IMPLEMENTED_IN]->(target)
    WITH alias
    DETACH DELETE alias                    # ← never reached
""")

The doc should add a note pointing users to the two-query transaction pattern and warning that folding delete into an edge-migration query requires OPTIONAL MATCH.


Ontology-Constrained KG Construction

Question: Can GraphForge be used in a workflow where an ontology constrains what entities and relationships can be created — ensuring the resulting KG conforms to a schema? The intent is not to add this as a first-party GF feature but to validate that GF works well as the store within this pattern.

Short answer: Yes, and all three viable approaches work correctly today. The constraint logic lives entirely outside GF in each case; GF is the store. Pydantic is the right tool for fail-fast validation at the Python layer.


Approach A — Pydantic validates before ingest

Pattern: Define Pydantic models that mirror the ontology. All LLM extraction output is validated against those models before any gf.execute() call runs. If validation fails, the whole batch is rejected — nothing reaches the graph.

from pydantic import BaseModel, Field, model_validator
from typing import Any, Optional
from graphforge import GraphForge

# Ontology defined as Python dicts (single source of truth)
CLASSES = {
    "Person":       {"required": ["name"],               "optional": ["description","url","confidence"]},
    "Technology":   {"required": ["name","description"], "optional": ["version","confidence"]},
    "Concept":      {"required": ["name"],               "optional": ["description","confidence"]},
}
RELS = {
    "DEPENDS_ON": {"domain": ["Technology"], "range": ["Technology"]},
    "RESEARCHES": {"domain": ["Person"],     "range": ["Concept"]},
    "CREATED":    {"domain": ["Person"],     "range": ["Technology"]},
}

class NodeRecord(BaseModel):
    type: str
    name: str
    description: Optional[str] = None
    confidence: float = Field(default=1.0, ge=0.0, le=1.0)

    @model_validator(mode="after")
    def check_class_and_required(self) -> "NodeRecord":
        if self.type not in CLASSES:
            raise ValueError(f"Unknown class '{self.type}'. Allowed: {sorted(CLASSES)}")
        for req in CLASSES[self.type]["required"]:
            if getattr(self, req, None) is None:
                raise ValueError(f"Class '{self.type}' requires property '{req}'")
        return self

class RelRecord(BaseModel):
    from_name: str; from_type: str; to_name: str; to_type: str; rel_type: str

    @model_validator(mode="after")
    def check_rel(self) -> "RelRecord":
        c = RELS.get(self.rel_type)
        if c is None:
            raise ValueError(f"Unknown rel type '{self.rel_type}'")
        if self.from_type not in c["domain"]:
            raise ValueError(f"'{self.rel_type}' domain must be {c['domain']}, got '{self.from_type}'")
        if self.to_type not in c["range"]:
            raise ValueError(f"'{self.rel_type}' range must be {c['range']}, got '{self.to_type}'")
        return self

class ExtractionBatch(BaseModel):
    nodes: list[NodeRecord]
    relationships: list[RelRecord]
    source: str

# Valid batch — ingests cleanly
batch = ExtractionBatch(
    source="doc:example",
    nodes=[
        NodeRecord(type="Technology", name="PyTorch", description="ML framework"),
        NodeRecord(type="Person",     name="Yann LeCun"),
        NodeRecord(type="Concept",    name="Deep Learning"),
    ],
    relationships=[
        RelRecord(from_name="PyTorch", from_type="Technology",
                  to_name="Deep Learning", to_type="Concept", rel_type="DEPENDS_ON"),
        # ↑ FAIL: DEPENDS_ON range is Technology, not Concept — Pydantic catches this
    ],
)

# Invalid batches caught before ingest:
# NodeRecord(type="Animal", name="Cat")          → "Unknown class 'Animal'"
# NodeRecord(type="Technology", name="X")        → "Technology requires property 'description'"
# RelRecord(..., rel_type="HATES")               → "Unknown rel type 'HATES'"
# RelRecord(from_type="Concept", rel_type="DEPENDS_ON") → "domain must be ['Technology']"

Characteristics: - Validation is instantaneous — no graph I/O before the check - All-or-nothing: one bad record in a batch of 100 rejects the whole batch (Pydantic's default) - The CLASSES/RELS dicts are the single source of truth for both Pydantic and ingest logic - Works naturally with LangChain structured output, instructor, and pydantic-ai — LLMs can be prompted to produce ExtractionBatch-shaped JSON directly - No ontology persistence — schema lives only in Python

Limitation: The ontology is not queryable at runtime. You cannot ask the graph "what classes are defined?" or "what relationships are valid for Technology?" — that information exists only in the Python module.


Approach B — Ontology-as-graph in GF, conformance via Cypher

Pattern: Load the ontology itself into GF as nodes and edges using reserved labels (:OntClass, :OntProp, :OntRel). Instance data lives in the same graph with normal labels. Run Cypher conformance queries after ingest to detect violations.

# Load ontology schema into GF
def load_ontology(gf: GraphForge, classes: dict, rels: dict) -> None:
    gf.begin()
    for cls_name, cls_def in classes.items():
        gf.execute("MERGE (:OntClass {name: $name})", {"name": cls_name})
        for prop in cls_def["required"]:
            gf.execute("""
                MERGE (p:OntProp {name: $prop, required: true, appliesTo: $cls})
                WITH p MATCH (c:OntClass {name: $cls}) MERGE (c)-[:REQUIRES]->(p)
            """, {"prop": prop, "cls": cls_name})
    for rel_name, rel_def in rels.items():
        gf.execute("MERGE (:OntRel {name: $name})", {"name": rel_name})
        for d in rel_def["domain"]:
            gf.execute("MATCH (rt:OntRel {name:$rt}),(c:OntClass {name:$cls}) MERGE (rt)-[:DOMAIN]->(c)",
                       {"rt": rel_name, "cls": d})
        for r in rel_def["range"]:
            gf.execute("MATCH (rt:OntRel {name:$rt}),(c:OntClass {name:$cls}) MERGE (rt)-[:RANGE]->(c)",
                       {"rt": rel_name, "cls": r})
    gf.commit()

# Query the ontology at runtime
valid_from_tech = gf.execute("""
    MATCH (c:OntClass {name: 'Technology'})<-[:DOMAIN]-(rt:OntRel)
    RETURN rt.name AS rel_type ORDER BY rel_type
""")
# → ['DEPENDS_ON', 'EMBODIES', 'IMPLEMENTS']

# Post-ingest conformance check
def find_domain_violations(gf: GraphForge, rt: str, allowed_domains: list[str]) -> list[dict]:
    rows = gf.execute(f"""
        MATCH (a)-[r:{rt}]->(b)
        WHERE NOT a:OntClass AND NOT a:OntProp AND NOT a:OntRel
        RETURN a.name AS from_node, labels(a)[0] AS from_label
    """)
    return [
        {"rel": rt, "from": r["from_node"].value, "label": r["from_label"].value}
        for r in rows if r["from_label"].value not in allowed_domains
    ]

Characteristics: - The ontology is a first-class citizen of the graph — queryable via Cypher like any other data - Enables dynamic workflows: "fetch all valid rel types for this class before building a prompt" - Catches violations in data loaded by any path (raw Cypher, bulk_ingest, scripts) - GF limitation: EXISTS {} subqueries are not supported (planner validation error) — conformance queries must use explicit per-class/per-rel-type loops in Python rather than single declarative Cypher - Schema and instance data share the same graph — requires label discipline (:OntClass etc.) to avoid mixing ontology nodes with instance nodes in queries

Key limitation found: The most natural Cypher conformance pattern uses EXISTS {} subqueries:

WHERE NOT EXISTS { MATCH (rt)-[:DOMAIN]->(d:Class) WHERE d.name IN labels(a) }
This raises a ValidationError in the GraphForge planner (PropertyAccess requires either 'variable' or 'base'). Workaround: run conformance checks as Python loops over per-class or per-rel-type queries rather than a single declarative Cypher statement.


Pattern: CLASSES/RELS are defined once as Python dicts. Pydantic models are generated from them for fail-fast validation. The same dicts also populate a GF ontology graph for runtime introspection and post-ingest auditing. Both layers reference the same source of truth so they cannot diverge.

# ── 1. Define ontology once ────────────────────────────────────────────────────
CLASSES = {
    "Technology": {"required": ["name","description"], "optional": ["version","confidence"]},
    "Person":     {"required": ["name"],               "optional": ["description","confidence"]},
    # ...
}
RELS = {
    "DEPENDS_ON": {"domain": ["Technology"], "range": ["Technology"]},
    "RESEARCHES": {"domain": ["Person"],     "range": ["Concept"]},
    # ...
}

# ── 2. Pydantic layer: fail-fast before ingest ─────────────────────────────────
batch = ExtractionBatch(nodes=[...], relationships=[...], source="doc:run-1")
# ValidationError raised here if anything violates the ontology — nothing reaches GF

# ── 3. Ingest validated data ──────────────────────────────────────────────────
ingest_batch(gf, batch)

# ── 4. GF ontology: runtime introspection ─────────────────────────────────────
schema = get_schema_for_class(gf, "Technology")
# {"required": ["description","name"], "optional": [...],
#  "outgoing_rels": ["DEPENDS_ON","EMBODIES","IMPLEMENTS"],
#  "incoming_rels": ["CREATED","DEPENDS_ON","MAINTAINS"]}

# ── 5. GF ontology: post-ingest audit (catches raw-Cypher bypasses) ────────────
violations = post_ingest_audit(gf)
# {"missing_required_properties": [{"class":"Technology","node":"SketchyLib","property":"description"}],
#  "domain_violations": [...], "unknown_rel_types": [...]}

# ── 6. GF ontology: generate LLM system prompt from live schema ────────────────
for row in gf.execute("MATCH (c:OntClass) RETURN c.name AS name ORDER BY name"):
    cls = row["name"].value
    s = get_schema_for_class(gf, cls)
    print(f"  - {cls} (required: {', '.join(s['required'])})")
# - Concept (required: name)
# - Technology (required: description, name)
# ...

Characteristics: - Single source of truth: add a class or rel type in one dict, both layers update automatically - Pydantic handles 99% of violations at zero cost before any graph I/O - GF ontology graph handles the remaining 1%: data written by scripts, imports, or other tools that bypassed Pydantic validation - The ontology graph also enables a powerful bonus: generating LLM prompts dynamically from the live schema — the graph drives what the LLM is asked to extract, so adding a new class automatically updates all downstream prompts - Same EXISTS {} limitation as Approach B applies; conformance audit uses Python loops


Comparison

Dimension A: Pydantic only B: GF ontology only C: Hybrid (recommended)
Fail-fast validation Yes — before any I/O No — post-ingest only Yes
Catches non-Pydantic writes No Yes Yes
Ontology queryable at runtime No Yes Yes
LLM prompt generation from schema No Yes Yes
Single source of truth Yes (Python dicts) No (must keep GF and code in sync) Yes (Python dicts → both layers)
EXISTS {} subquery support needed No Yes (workaround needed) Yes (workaround needed)
GF version required Current Current Current
Complexity Low Medium Medium

Recommendation: Approach C. The extra cost over A is loading the ontology into GF once at startup (~50ms for a typical ontology). The payoff is: runtime schema introspection, post-ingest audit for any bypass path, and dynamic LLM prompt generation — all via the same Cypher interface as the instance data.

GF limitation surfaced: EXISTS {} subquery in planner

The most natural Cypher for conformance checking (single declarative query finding all violations) requires EXISTS {} subqueries with outer variable binding:

MATCH (a)-[r]->(b)
WHERE NOT a:OntClass
  AND NOT EXISTS { MATCH (rt:OntRel {name: type(r)})-[:DOMAIN]->(d:OntClass) WHERE d.name IN labels(a) }
RETURN type(r), labels(a), a.name

This raises ValidationError: PropertyAccess requires either 'variable' or 'base' in the planner. The workaround — Python loops calling per-class/per-rel-type Cypher — is functional but verbose. A fully declarative conformance query would be cleaner. Tracked in issue #474.