Knowledge Graph Construction — Research Findings¶
Issue: #449
Branch: main (v0.3.9 / v0.3.10-dev)
Date: 2026-05-04
Executive Summary¶
GraphForge's core MERGE-based KG construction pattern is solid: idempotent entity and
relationship loading, provenance tracking, two-pass confidence updates, and the full
export suite (to_dicts, to_networkx, to_igraph, to_dataframe) all work correctly
in a 20-entity / 26-relationship pipeline test at comfortable speed (~28 ms initial load,
296 K nodes/sec bulk ingest). Two gaps block a complete use-case experience: shortestPath
raises a parse-time SyntaxError rather than being unimplemented (the use-case doc
shows it as working, issue #468), and four public API methods that users would reach for
first — gf.labels(), gf.relationship_types(), gf.node_count(),
gf.relationship_count() — do not exist despite the underlying data being readily
available via gf.graph.get_statistics() and private indexes.
Code Snippet Pass/Fail Matrix¶
All snippets taken from docs/use-cases/knowledge-graph-construction.md and run in
isolation against current main.
| # | Snippet | Status | Notes |
|---|---|---|---|
| S1 | Entity MERGE with .format(label=...) |
PASS | — |
| S2 | Relationship MERGE after MATCH | PASS | — |
| S3 | datetime() in ON CREATE SET |
PASS | — |
| S4 | CASE + coalesce confidence update |
PASS | — |
| S5 | Entity dedup (edge migration + DETACH DELETE) | **PASS**¹ | Two-query pattern in doc is correct |
| S6 | Find all entities of a type | PASS | — |
| S7 | Traverse + type(r) |
PASS | — |
| S8 | shortestPath(...) |
RESOLVED | Now raises NotImplementedError with BFS workaround hint (PR #497) |
| S9 | Aggregate by rel type | PASS | — |
| S10 | 2-hop var-length subgraph | PASS | — |
| S11 | Complete worked example | PASS | — |
¹ The doc's two-execute() transaction pattern is correct. A single-query variant that
folds edge migration and DETACH DELETE into one query with WITH alias, canon, r, target
WHERE r IS NOT NULL correctly silences the delete when the alias has no edges (standard
openCypher WITH/WHERE semantics). The doc should note this as a footgun.
Extended Pipeline Test¶
Mock dataset: 8 entity types (Person, Organization, Technology, Concept, Document, Location, Event, License), 20 nodes, 26 relationships.
Results¶
| Stage | Checks | Status | Timing |
|---|---|---|---|
| 1 — Initial load (MERGE + ON CREATE SET) | 2/2 | PASS | 28 ms |
| 2 — Second pass (ON MATCH SET confirmations++) | 2/2 | PASS | 16 ms |
| 3 — Confidence update (CASE + coalesce) | 2/2 | PASS | < 1 ms |
| 4 — Entity dedup (alias merge + DETACH DELETE) | 2/2 | PASS | < 1 ms |
| 5 — Querying (traverse, aggregate, 2-hop, shared-dep, low-conf filter) | 8/8 | PASS | < 1 ms each |
| 6 — Export (to_dicts, to_networkx, to_igraph, to_dataframe) | 13/13 | PASS | < 5 ms |
| 7 — Bulk ingest (10 K nodes) | 2/2 | PASS | 34 ms (296 K nodes/sec) |
Total: 31/31 checks — all pass.
Runnable pipeline¶
from graphforge import GraphForge
ENTITIES = [
{"label": "Person", "name": "Guido van Rossum", "description": "Creator of Python"},
{"label": "Organization", "name": "Google", "description": "Tech giant"},
{"label": "Technology", "name": "Python", "description": "High-level language"},
# … 17 more
]
RELATIONSHIPS = [
{"from": "Guido van Rossum", "to": "Python", "type": "CREATED", "conf": 1.0},
{"from": "Google", "to": "Python", "type": "DEPENDS_ON", "conf": 0.9},
# … 24 more
]
gf = GraphForge()
SOURCE, MODEL = "doc:research:pipeline-test", "gpt-4o-mock"
gf.begin()
for ent in ENTITIES:
gf.execute(
"""
MERGE (e:{label} {{name: $name}})
ON CREATE SET e.description = $desc,
e.source = $source,
e.extractedBy = $model,
e.extractedAt = datetime()
""".format(label=ent["label"]),
{"name": ent["name"], "desc": ent["description"], "source": SOURCE, "model": MODEL},
)
for rel in RELATIONSHIPS:
gf.execute(
"""
MATCH (a {{name: $from_name}})
MATCH (b {{name: $to_name}})
MERGE (a)-[r:{rel_type}]->(b)
ON CREATE SET r.confidence = $conf, r.confirmations = 1, r.extractedAt = datetime()
""".format(rel_type=rel["type"]),
{"from_name": rel["from"], "to_name": rel["to"], "conf": rel["conf"]},
)
gf.commit()
# Export
nodes_df = gf.to_dataframe("MATCH (n) RETURN labels(n)[0] AS label, n.name AS name")
G = gf.to_networkx()
Friction Points¶
FP-1 — Dynamic labels require unsafe string interpolation (→ #471)¶
Problem: openCypher grammar restricts labels to IDENTIFIER terminals.
$label in label position is a SyntaxError. Users must build queries via
string formatting with no sanitization layer.
# FAILS — parameterized label not supported
gf.execute("MATCH (n:$label) RETURN n", {"label": "Person"})
# SyntaxError: Unexpected token Token('DOLLAR', '$') at line 1, column 10.
# Required workaround — injection risk if label is user-supplied
label = "Person"
gf.execute(f"MATCH (n:{label}) RETURN n")
Impact: High. Every LLM extraction pipeline uses dynamic entity types. Users who come from Neo4j (which also lacks parameterized labels) know this pattern, but it's still the most-asked-about footgun in graph DB onboarding.
Neo4j / KùzuDB: Neither supports parameterized labels in Cypher either.
Neo4j's apoc.merge.node() accepts a list of labels as a parameter.
KùzuDB requires schema DDL (CREATE NODE TABLE) so the label is always static.
FP-2 — No schema introspection API (→ #469) ✅ Resolved in v0.3.10¶
Resolution (PR #499): All four methods now exist:
gf.labels() # ['Organization', 'Person', 'Technology']
gf.relationship_types() # ['CREATED', 'DEPENDS_ON']
gf.node_count() # 20
gf.node_count("Person") # 8
gf.relationship_count() # 26
Original problem: gf.labels(), gf.relationship_types(), gf.node_count(), and
gf.relationship_count() did not exist. The workaround required reaching into private
_label_index / _type_index structures and calling gf.graph.get_statistics() with
undocumented field names.
FP-3 — No constraint enforcement (→ #473)¶
Problem: CREATE happily creates duplicate nodes. MERGE is the only
idempotency mechanism, and it requires per-query discipline.
gf.execute("CREATE (:Person {name: 'Alice'})")
gf.execute("CREATE (:Person {name: 'Alice'})") # no error — two Alice nodes now exist
count = gf.execute("MATCH (n:Person {name:'Alice'}) RETURN count(n) AS c")[0]["c"].value
# count == 2
# Workaround: always use MERGE
gf.execute("MERGE (:Person {name: 'Alice'})")
gf.execute("MERGE (:Person {name: 'Alice'})") # idempotent — one node
Impact: Medium. Users familiar with openCypher know to use MERGE. New users
and LLM-generated code both default to CREATE. No uniqueness constraint DDL
(CREATE CONSTRAINT) means the burden is entirely on the query author.
Neo4j: CREATE CONSTRAINT ON (n:Person) ASSERT n.name IS UNIQUE
KùzuDB: Schema DDL (CREATE NODE TABLE) enforces primary keys.
FP-4 — Export asymmetry: no to_json() / from_json() (→ #470) ✅ Resolved in v0.3.10¶
Resolution (PR #500): Both methods now exist as first-class API:
gf.to_json("kg.json", metadata={"name": "My KG"}) # export
gf2 = GraphForge.from_json("kg.json") # round-trip import
Original problem: to_json() and from_json() did not exist. The workaround
required manually importing JSONGraphExporter, constructing it, and calling
.export() — four steps with no round-trip import path at all.
FP-5 — Bulk ingest requires a context manager loop, no vectorized call (→ #472) ✅ Resolved in v0.3.10¶
Resolution (PR #502): add_graph_documents() accepts a list of LangChain
GraphDocument objects or plain dicts in a single call:
gf.add_graph_documents([
{
"nodes": [{"id": "Python", "type": "Technology", "properties": {}}],
"relationships": [],
},
...
])
Nodes are merged by id + label (idempotent); relationships deduplicated on
(source, target, type, properties). All labels and relationship types validated
before any writes.
Original problem: There was no batch API for LangChain-style extraction output.
The with bulk_ingest(): create_node_bulk(...) pattern was fast but required
boilerplate, and had no idempotency guarantees.
FP-6 — shortestPath raises SyntaxError at parse time ✅ Resolved in v0.3.10¶
Resolution (PR #497): shortestPath() and allShortestPaths() now parse
correctly and raise NotImplementedError with a BFS workaround hint:
gf.execute("MATCH p = shortestPath((a)-[*]-(b)) RETURN p")
# NotImplementedError: shortestPath() is not yet implemented.
# Use the BFS workaround: MATCH path = (a)-[*1..N]-(b)
# RETURN length(path) ORDER BY length(path) LIMIT 1
# Workaround:
result = gf.execute(
"MATCH path = (a:Technology {name: $src})-[*1..6]-(b:Technology {name: $dst}) "
"RETURN length(path) AS hops ORDER BY hops LIMIT 1",
{"src": "Python", "dst": "SQLite"},
)
Original problem: The parser had no grammar rule for shortestPath. Calling it
raised a confusing SyntaxError mentioning LPAR, not a clear "not implemented" message.
Competitive Analysis¶
| Capability | GraphForge | Neo4j | KùzuDB | LangChain integration |
|---|---|---|---|---|
| Parameterized labels | No (string interpolation only) | No (APOC workaround) | No (schema DDL) | N/A |
| Schema introspection | gf.labels(), gf.relationship_types() ✅ |
db.labels(), db.schema |
CALL show_tables() |
graph.get_schema |
| Uniqueness constraints | No DDL — MERGE is sole mechanism | CREATE CONSTRAINT |
Primary key in DDL | N/A |
| JSON export | gf.to_json() / gf.from_json() ✅ |
apoc.export.json.all() |
COPY TO |
graph.to_json() |
| Vectorized node batch | add_graph_documents() ✅ |
UNWIND $batch CREATE |
COPY FROM CSV |
add_graph_documents() |
| Shortest path | NotImplementedError + BFS workaround ✅ |
shortestPath() + GDS library |
shortestPath() |
N/A |
LangChain GraphDocument ingest |
add_graph_documents() ✅ |
Neo4jGraph.add_graph_documents() |
No | Direct |
What Neo4j does better for KG construction¶
UNWIND $batch CREATE (n:$label)withapoc.merge.node()for dynamic labels- Constraint DDL ensures uniqueness at the DB level, not query level
db.schema.visualization()returns a full schema with cardinality infoshortestPath()plus the Graph Data Science library for full path algorithms
What KùzuDB does better¶
- Schema-first DDL (
CREATE NODE TABLE,CREATE REL TABLE) eliminates duplicate nodes by design; every node type has a typed primary key COPY FROM csv_file TO NodeTablefor bulk ingest without Python loop boilerplate
What LangChain does better for ecosystem fit¶
# LangChain GraphDocument standard — one call, structured output
from langchain_community.graphs.graph_document import GraphDocument, Node, Relationship
doc = GraphDocument(
nodes=[
Node(id="python", type="Technology", properties={"description": "..."}),
Node(id="pytorch", type="Technology", properties={"description": "..."}),
],
relationships=[
Relationship(source=Node(id="pytorch", type="Technology"),
target=Node(id="python", type="Technology"),
type="DEPENDS_ON"),
],
)
graph.add_graph_documents([doc]) # Neo4jGraph, MemgraphGraph, etc.
GraphForge has no equivalent. Users who switch from LangChain's Neo4j integration must rewrite their extraction pipeline entirely.
Prioritized Recommendations¶
Priority is based on: (1) how often a KG construction user hits this, (2) implementation effort, (3) competitive gap size.
R-1 — Schema introspection methods (Priority: High / Effort: Low) — issue #469¶
The data already exists in gf.graph.get_statistics() and _label_index /
_type_index. This is a thin public API wrapper.
# Proposed signatures
def labels(self) -> list[str]: ...
def relationship_types(self) -> list[str]: ...
def node_count(self, label: str | None = None) -> int: ...
def relationship_count(self, rel_type: str | None = None) -> int: ...
Implementation: In src/graphforge/api.py, delegate to:
- sorted(self.graph._label_index.keys())
- sorted(self.graph._type_index.keys())
- self.graph.get_statistics().total_nodes (or label-filtered len of _label_index[label])
- self.graph.get_statistics().total_edges (or type-filtered)
R-2 — to_json() / from_json() methods (Priority: Medium / Effort: Low) — issue #470¶
Wraps the existing JSONGraphExporter into the same calling style as to_networkx().
def to_json(
self,
path: Path | str,
metadata: dict[str, Any] | None = None,
) -> None: ...
@classmethod
def from_json(cls, path: Path | str) -> "GraphForge": ...
Implementation: to_json delegates to JSONGraphExporter().export(self, Path(path), metadata).
from_json uses the matching JSONGraphImporter if it exists, or raises NotImplementedError
as a placeholder. Adds to_json/from_json to the export family in api.py.
R-3 — shortestPath graceful degradation (Priority: High / Effort: Low–Medium) — issue #468¶
At minimum, parse shortestPath(...) as a recognized construct and raise
NotImplementedError with a clear message. Full BFS implementation is a separate story.
# In cypher.lark — add to path_pattern rule:
path_function: "shortestPath" "(" pattern_element ")"
| "allShortestPaths" "(" pattern_element ")"
Implementation (graceful degradation):
1. Add grammar rule in cypher.lark
2. Add transformer method in parser.py producing ShortestPathCall AST node
3. In planner: emit a ShortestPath operator
4. In executor: raise NotImplementedError("shortestPath is not yet implemented — see issue #468")
Full BFS implementation is tracked in #468.
R-4 — Merge helper for dynamic labels (Priority: Medium / Effort: Medium) — issue #471¶
Eliminates unsafe string interpolation for the most common KG operation.
def merge_node(
self,
labels: list[str],
match_on: dict[str, Any],
on_create: dict[str, Any] | None = None,
on_match: dict[str, Any] | None = None,
) -> NodeRef:
"""Idempotent node upsert with safe label injection."""
Implementation: Validates each label against re.fullmatch(r"[A-Za-z_][A-Za-z0-9_]*", label)
(raising ValueError on invalid input), then builds and executes the MERGE query internally.
This makes labels safe for dynamic use without exposing raw string interpolation to callers.
R-5 — LangChain-compatible add_graph_documents() (Priority: Low–Medium / Effort: Medium) — issue #472¶
Closes the ecosystem gap for users migrating from LangChain + Neo4j.
from graphforge.integrations.langchain import GraphDocument, Node, Relationship
def add_graph_documents(
self,
documents: list[GraphDocument],
*,
base_entity_label: bool = False,
include_source: bool = False,
) -> None: ...
Implementation: Translate each Node/Relationship into MERGE calls internally,
reusing merge_node() from R-4 for safe label injection. Mirror LangChain's
Neo4jGraph.add_graph_documents() signature for drop-in compatibility.
Dedup Pattern Clarification (for doc update)¶
The single-query dedup pattern with WITH ... WHERE r IS NOT NULL before DETACH DELETE
is a footgun: when the alias has no outgoing edges, the WHERE filters out all rows and
the delete never fires. The correct pattern uses two separate execute() calls inside a
transaction (as shown in the current doc), or uses a CALL {} subquery to scope the edge
migration independently:
# ✅ Correct — two-query transaction (current doc pattern)
gf.begin()
gf.execute("""
MATCH (alias:Technology {name: 'OldName'})
MATCH (canon:Technology {name: 'NewName'})
WITH alias, canon
MATCH (alias)-[r]->(target)
MERGE (canon)-[:IMPLEMENTED_IN {confidence: r.confidence}]->(target)
""")
gf.execute("MATCH (alias:Technology {name: 'OldName'}) DETACH DELETE alias")
gf.commit()
# ✅ Also correct — CALL subquery scopes edge migration independently
gf.execute("""
MATCH (alias:Technology {name: 'OldName'}), (canon:Technology {name: 'NewName'})
OPTIONAL MATCH (alias)-[r]->(target)
WITH alias, canon, r, target
CALL {
WITH alias, canon, r, target
WITH alias, canon, r, target WHERE r IS NOT NULL
MERGE (canon)-[:IMPLEMENTED_IN]->(target)
}
DETACH DELETE alias
""")
# ❌ Incorrect — WHERE filters rows, DETACH DELETE silently skipped when no edges
gf.execute("""
MATCH (alias:Technology {name: 'OldName'})
MATCH (canon:Technology {name: 'NewName'})
WITH alias, canon
MATCH (alias)-[r]->(target) # ← empty result when alias has no edges
MERGE (canon)-[:IMPLEMENTED_IN]->(target)
WITH alias
DETACH DELETE alias # ← never reached
""")
The doc should add a note pointing users to the two-query transaction pattern and
warning that folding delete into an edge-migration query requires OPTIONAL MATCH.
Ontology-Constrained KG Construction¶
Question: Can GraphForge be used in a workflow where an ontology constrains what entities and relationships can be created — ensuring the resulting KG conforms to a schema? The intent is not to add this as a first-party GF feature but to validate that GF works well as the store within this pattern.
Short answer: Yes, and all three viable approaches work correctly today. The constraint logic lives entirely outside GF in each case; GF is the store. Pydantic is the right tool for fail-fast validation at the Python layer.
Approach A — Pydantic validates before ingest¶
Pattern: Define Pydantic models that mirror the ontology. All LLM extraction output is
validated against those models before any gf.execute() call runs. If validation fails,
the whole batch is rejected — nothing reaches the graph.
from pydantic import BaseModel, Field, model_validator
from typing import Any, Optional
from graphforge import GraphForge
# Ontology defined as Python dicts (single source of truth)
CLASSES = {
"Person": {"required": ["name"], "optional": ["description","url","confidence"]},
"Technology": {"required": ["name","description"], "optional": ["version","confidence"]},
"Concept": {"required": ["name"], "optional": ["description","confidence"]},
}
RELS = {
"DEPENDS_ON": {"domain": ["Technology"], "range": ["Technology"]},
"RESEARCHES": {"domain": ["Person"], "range": ["Concept"]},
"CREATED": {"domain": ["Person"], "range": ["Technology"]},
}
class NodeRecord(BaseModel):
type: str
name: str
description: Optional[str] = None
confidence: float = Field(default=1.0, ge=0.0, le=1.0)
@model_validator(mode="after")
def check_class_and_required(self) -> "NodeRecord":
if self.type not in CLASSES:
raise ValueError(f"Unknown class '{self.type}'. Allowed: {sorted(CLASSES)}")
for req in CLASSES[self.type]["required"]:
if getattr(self, req, None) is None:
raise ValueError(f"Class '{self.type}' requires property '{req}'")
return self
class RelRecord(BaseModel):
from_name: str; from_type: str; to_name: str; to_type: str; rel_type: str
@model_validator(mode="after")
def check_rel(self) -> "RelRecord":
c = RELS.get(self.rel_type)
if c is None:
raise ValueError(f"Unknown rel type '{self.rel_type}'")
if self.from_type not in c["domain"]:
raise ValueError(f"'{self.rel_type}' domain must be {c['domain']}, got '{self.from_type}'")
if self.to_type not in c["range"]:
raise ValueError(f"'{self.rel_type}' range must be {c['range']}, got '{self.to_type}'")
return self
class ExtractionBatch(BaseModel):
nodes: list[NodeRecord]
relationships: list[RelRecord]
source: str
# Valid batch — ingests cleanly
batch = ExtractionBatch(
source="doc:example",
nodes=[
NodeRecord(type="Technology", name="PyTorch", description="ML framework"),
NodeRecord(type="Person", name="Yann LeCun"),
NodeRecord(type="Concept", name="Deep Learning"),
],
relationships=[
RelRecord(from_name="PyTorch", from_type="Technology",
to_name="Deep Learning", to_type="Concept", rel_type="DEPENDS_ON"),
# ↑ FAIL: DEPENDS_ON range is Technology, not Concept — Pydantic catches this
],
)
# Invalid batches caught before ingest:
# NodeRecord(type="Animal", name="Cat") → "Unknown class 'Animal'"
# NodeRecord(type="Technology", name="X") → "Technology requires property 'description'"
# RelRecord(..., rel_type="HATES") → "Unknown rel type 'HATES'"
# RelRecord(from_type="Concept", rel_type="DEPENDS_ON") → "domain must be ['Technology']"
Characteristics:
- Validation is instantaneous — no graph I/O before the check
- All-or-nothing: one bad record in a batch of 100 rejects the whole batch (Pydantic's default)
- The CLASSES/RELS dicts are the single source of truth for both Pydantic and ingest logic
- Works naturally with LangChain structured output, instructor, and pydantic-ai — LLMs can be prompted to produce ExtractionBatch-shaped JSON directly
- No ontology persistence — schema lives only in Python
Limitation: The ontology is not queryable at runtime. You cannot ask the graph "what classes are defined?" or "what relationships are valid for Technology?" — that information exists only in the Python module.
Approach B — Ontology-as-graph in GF, conformance via Cypher¶
Pattern: Load the ontology itself into GF as nodes and edges using reserved labels
(:OntClass, :OntProp, :OntRel). Instance data lives in the same graph with normal
labels. Run Cypher conformance queries after ingest to detect violations.
# Load ontology schema into GF
def load_ontology(gf: GraphForge, classes: dict, rels: dict) -> None:
gf.begin()
for cls_name, cls_def in classes.items():
gf.execute("MERGE (:OntClass {name: $name})", {"name": cls_name})
for prop in cls_def["required"]:
gf.execute("""
MERGE (p:OntProp {name: $prop, required: true, appliesTo: $cls})
WITH p MATCH (c:OntClass {name: $cls}) MERGE (c)-[:REQUIRES]->(p)
""", {"prop": prop, "cls": cls_name})
for rel_name, rel_def in rels.items():
gf.execute("MERGE (:OntRel {name: $name})", {"name": rel_name})
for d in rel_def["domain"]:
gf.execute("MATCH (rt:OntRel {name:$rt}),(c:OntClass {name:$cls}) MERGE (rt)-[:DOMAIN]->(c)",
{"rt": rel_name, "cls": d})
for r in rel_def["range"]:
gf.execute("MATCH (rt:OntRel {name:$rt}),(c:OntClass {name:$cls}) MERGE (rt)-[:RANGE]->(c)",
{"rt": rel_name, "cls": r})
gf.commit()
# Query the ontology at runtime
valid_from_tech = gf.execute("""
MATCH (c:OntClass {name: 'Technology'})<-[:DOMAIN]-(rt:OntRel)
RETURN rt.name AS rel_type ORDER BY rel_type
""")
# → ['DEPENDS_ON', 'EMBODIES', 'IMPLEMENTS']
# Post-ingest conformance check
def find_domain_violations(gf: GraphForge, rt: str, allowed_domains: list[str]) -> list[dict]:
rows = gf.execute(f"""
MATCH (a)-[r:{rt}]->(b)
WHERE NOT a:OntClass AND NOT a:OntProp AND NOT a:OntRel
RETURN a.name AS from_node, labels(a)[0] AS from_label
""")
return [
{"rel": rt, "from": r["from_node"].value, "label": r["from_label"].value}
for r in rows if r["from_label"].value not in allowed_domains
]
Characteristics:
- The ontology is a first-class citizen of the graph — queryable via Cypher like any other data
- Enables dynamic workflows: "fetch all valid rel types for this class before building a prompt"
- Catches violations in data loaded by any path (raw Cypher, bulk_ingest, scripts)
- GF limitation: EXISTS {} subqueries are not supported (planner validation error) — conformance
queries must use explicit per-class/per-rel-type loops in Python rather than single declarative Cypher
- Schema and instance data share the same graph — requires label discipline (:OntClass etc.)
to avoid mixing ontology nodes with instance nodes in queries
Key limitation found: The most natural Cypher conformance pattern uses EXISTS {} subqueries:
WHERE NOT EXISTS { MATCH (rt)-[:DOMAIN]->(d:Class) WHERE d.name IN labels(a) }
ValidationError in the GraphForge planner (PropertyAccess requires either
'variable' or 'base'). Workaround: run conformance checks as Python loops over per-class
or per-rel-type queries rather than a single declarative Cypher statement.
Approach C — Hybrid: Pydantic + GF ontology graph (recommended)¶
Pattern: CLASSES/RELS are defined once as Python dicts. Pydantic models are generated from them for fail-fast validation. The same dicts also populate a GF ontology graph for runtime introspection and post-ingest auditing. Both layers reference the same source of truth so they cannot diverge.
# ── 1. Define ontology once ────────────────────────────────────────────────────
CLASSES = {
"Technology": {"required": ["name","description"], "optional": ["version","confidence"]},
"Person": {"required": ["name"], "optional": ["description","confidence"]},
# ...
}
RELS = {
"DEPENDS_ON": {"domain": ["Technology"], "range": ["Technology"]},
"RESEARCHES": {"domain": ["Person"], "range": ["Concept"]},
# ...
}
# ── 2. Pydantic layer: fail-fast before ingest ─────────────────────────────────
batch = ExtractionBatch(nodes=[...], relationships=[...], source="doc:run-1")
# ValidationError raised here if anything violates the ontology — nothing reaches GF
# ── 3. Ingest validated data ──────────────────────────────────────────────────
ingest_batch(gf, batch)
# ── 4. GF ontology: runtime introspection ─────────────────────────────────────
schema = get_schema_for_class(gf, "Technology")
# {"required": ["description","name"], "optional": [...],
# "outgoing_rels": ["DEPENDS_ON","EMBODIES","IMPLEMENTS"],
# "incoming_rels": ["CREATED","DEPENDS_ON","MAINTAINS"]}
# ── 5. GF ontology: post-ingest audit (catches raw-Cypher bypasses) ────────────
violations = post_ingest_audit(gf)
# {"missing_required_properties": [{"class":"Technology","node":"SketchyLib","property":"description"}],
# "domain_violations": [...], "unknown_rel_types": [...]}
# ── 6. GF ontology: generate LLM system prompt from live schema ────────────────
for row in gf.execute("MATCH (c:OntClass) RETURN c.name AS name ORDER BY name"):
cls = row["name"].value
s = get_schema_for_class(gf, cls)
print(f" - {cls} (required: {', '.join(s['required'])})")
# - Concept (required: name)
# - Technology (required: description, name)
# ...
Characteristics:
- Single source of truth: add a class or rel type in one dict, both layers update automatically
- Pydantic handles 99% of violations at zero cost before any graph I/O
- GF ontology graph handles the remaining 1%: data written by scripts, imports, or other tools
that bypassed Pydantic validation
- The ontology graph also enables a powerful bonus: generating LLM prompts dynamically from
the live schema — the graph drives what the LLM is asked to extract, so adding a new class
automatically updates all downstream prompts
- Same EXISTS {} limitation as Approach B applies; conformance audit uses Python loops
Comparison¶
| Dimension | A: Pydantic only | B: GF ontology only | C: Hybrid (recommended) |
|---|---|---|---|
| Fail-fast validation | Yes — before any I/O | No — post-ingest only | Yes |
| Catches non-Pydantic writes | No | Yes | Yes |
| Ontology queryable at runtime | No | Yes | Yes |
| LLM prompt generation from schema | No | Yes | Yes |
| Single source of truth | Yes (Python dicts) | No (must keep GF and code in sync) | Yes (Python dicts → both layers) |
EXISTS {} subquery support needed |
No | Yes (workaround needed) | Yes (workaround needed) |
| GF version required | Current | Current | Current |
| Complexity | Low | Medium | Medium |
Recommendation: Approach C. The extra cost over A is loading the ontology into GF once at startup (~50ms for a typical ontology). The payoff is: runtime schema introspection, post-ingest audit for any bypass path, and dynamic LLM prompt generation — all via the same Cypher interface as the instance data.
GF limitation surfaced: EXISTS {} subquery in planner¶
The most natural Cypher for conformance checking (single declarative query finding all
violations) requires EXISTS {} subqueries with outer variable binding:
MATCH (a)-[r]->(b)
WHERE NOT a:OntClass
AND NOT EXISTS { MATCH (rt:OntRel {name: type(r)})-[:DOMAIN]->(d:OntClass) WHERE d.name IN labels(a) }
RETURN type(r), labels(a), a.name
This raises ValidationError: PropertyAccess requires either 'variable' or 'base' in the
planner. The workaround — Python loops calling per-class/per-rel-type Cypher — is functional
but verbose. A fully declarative conformance query would be cleaner. Tracked in issue #474.