Knowledge Graph Construction¶
Overview¶
Knowledge graphs give structure to information that starts life as unstructured text — research papers, support tickets, product catalogues, codebases, or any corpus where entities and their relationships matter. GraphForge is well suited to this pattern because it is embedded (no server, no network hop), fully openCypher-compatible, and provides MERGE semantics that make incremental, idempotent graph construction natural.
When to use this pattern¶
- LLM extraction pipelines — run a language model over documents to pull out entities and relationships, then load each batch into the graph.
- Document analysis — index papers, reports, or web pages; connect concepts across documents; answer "what is linked to what?" queries.
- Ontology building — define a class hierarchy, populate instances from data, iterate as understanding of the domain evolves.
- Multi-source fusion — merge extractions from several models or tools into a single coherent graph, tracking provenance per node and edge.
Schema Design¶
Consistent naming makes Cypher queries readable and avoids accidental duplication.
Label conventions¶
Use PascalCase for labels. One label per logical entity type is a good default; add secondary labels when you need to tag provenance or status.
Person Technology Document Concept Organisation
Property keys¶
Use camelCase for property keys. Reserve name as the primary human-readable
identifier on every node. Add id when you need a stable external key (e.g. a
Wikidata QID or DOI).
name description id url createdAt confidence source
Relationship naming¶
Use SCREAMING_SNAKE_CASE for relationship types. Prefer verb phrases that read naturally from left to right.
MENTIONS RELATES_TO DEPENDS_ON PART_OF WROTE CITES IMPLEMENTS
LangChain Integration — add_graph_documents()¶
If you are using LangChain's extraction pipeline, GraphForge accepts
GraphDocument objects directly. No manual translation required:
from graphforge import GraphForge
# from langchain_community.graphs.graph_document import GraphDocument, Node, Relationship
gf = GraphForge("knowledge_graph.db")
# LangChain extraction output (or duck-typed equivalents)
# doc = GraphDocument(nodes=[...], relationships=[...], source=document)
# gf.add_graph_documents([doc], include_source=True)
add_graph_documents() is idempotent: calling it twice with the same data
produces no duplicate nodes or edges. Nodes are merged on id + label;
relationships are deduplicated on (source, target, type, properties).
Label safety: all node labels and relationship types are validated before any data is written. A single invalid identifier aborts the entire call with
ValueErrorand leaves the graph unchanged.
Plain dicts are also accepted:
gf.add_graph_documents([
{
"nodes": [
{"id": "python", "type": "Language", "properties": {"description": "Dynamic language"}},
{"id": "graphforge", "type": "Library", "properties": {"description": "Embedded graph DB"}},
],
"relationships": [
{"source_id": "graphforge", "target_id": "python", "type": "RUNS_ON", "properties": {}},
],
}
])
Entity Extraction Pattern¶
The canonical Cypher MERGE pattern for loading extracted entities — use this when
you need fine-grained control over ON CREATE SET / ON MATCH SET semantics or
when you are not using LangChain.
CREATE vs MERGE:
CREATEalways inserts a new node, even if an identical one already exists. Always useMERGE(oradd_graph_documents()) for extraction pipelines — running the same extraction twice withCREATEwill double your graph.
from graphforge import GraphForge
db = GraphForge("knowledge_graph.db") # persistent SQLite
# --- raw extraction output (e.g. from an LLM) ---
entities = [
{"type": "Technology", "name": "Python", "description": "General-purpose language"},
{"type": "Technology", "name": "GraphForge","description": "Embedded graph database"},
{"type": "Technology", "name": "Python", "description": "General-purpose language"}, # duplicate
]
# Load all entities idempotently
for entity in entities:
db.execute(
"""
MERGE (e:{label} {{name: $name}})
ON CREATE SET e.description = $description,
e.createdAt = datetime()
""".format(label=entity["type"]),
{"name": entity["name"], "description": entity["description"]},
)
# Confirm: only two Technology nodes despite three inputs
results = db.execute("MATCH (t:Technology) RETURN t.name AS name ORDER BY name")
print([row["name"].value for row in results])
# ['GraphForge', 'Python']
MERGE on name ensures the second Python entry is silently skipped rather than
inserted again. ON CREATE SET populates properties only when the node is new, so
a second pass does not overwrite values that may have been updated manually.
Relationship Extraction Pattern¶
Once entities are in the graph, link them with MERGE to avoid duplicate edges. Always MATCH both endpoints first — if either is missing the relationship is skipped rather than creating a dangling reference.
relationships = [
{"from": "GraphForge", "to": "Python", "type": "IMPLEMENTED_IN", "confidence": 0.98},
{"from": "GraphForge", "to": "Python", "type": "IMPLEMENTED_IN", "confidence": 0.98}, # dup
]
for rel in relationships:
db.execute(
"""
MATCH (a {{name: $from_name}})
MATCH (b {{name: $to_name}})
MERGE (a)-[r:{rel_type}]->(b)
ON CREATE SET r.confidence = $confidence,
r.createdAt = datetime()
""".format(rel_type=rel["type"]),
{
"from_name": rel["from"],
"to_name": rel["to"],
"confidence": rel["confidence"],
},
)
results = db.execute(
"MATCH (:Technology)-[r:IMPLEMENTED_IN]->(:Technology) RETURN count(r) AS total"
)
print(results[0]["total"].value) # 1 — not 2
Iterative Refinement¶
Knowledge graphs are rarely built in a single pass. A typical pipeline runs multiple extraction rounds, each adding detail or correcting earlier mistakes.
Add provenance and source tracking¶
# Tag each node with the document it came from
db.execute(
"""
MATCH (e:Technology {name: $name})
SET e.source = $source,
e.extractedAt = datetime()
""",
{"name": "Python", "source": "doc:arxiv:2301.00001"},
)
Update confidence scores¶
When a second extraction confirms an existing relationship, raise its confidence:
db.execute(
"""
MATCH (a {name: $from_name})-[r:IMPLEMENTED_IN]->(b {name: $to_name})
SET r.confidence = CASE WHEN r.confidence < $new_conf
THEN $new_conf
ELSE r.confidence
END,
r.confirmations = coalesce(r.confirmations, 1) + 1
""",
{"from_name": "GraphForge", "to_name": "Python", "new_conf": 0.99},
)
Merge duplicate entities¶
When two nodes represent the same real-world entity (detected by a fuzzy match or a human review step), redirect all edges to the canonical node and delete the alias.
# Suppose 'GraphForge DB' and 'GraphForge' are the same thing
db.begin()
db.execute(
"""
MATCH (alias:Technology {name: 'GraphForge DB'})
MATCH (canon:Technology {name: 'GraphForge'})
WITH alias, canon
MATCH (alias)-[r]->(target)
MERGE (canon)-[:{rel_type} {{confidence: r.confidence}}]->(target)
""".replace("{rel_type}", "IMPLEMENTED_IN") # repeat per rel type as needed
)
db.execute("MATCH (alias:Technology {name: 'GraphForge DB'}) DETACH DELETE alias")
db.commit()
Querying Patterns¶
Find all entities of a type¶
results = db.execute(
"""
MATCH (t:Technology)
RETURN t.name AS name, t.description AS description
ORDER BY t.name
"""
)
for row in results:
print(f"{row['name'].value}: {row['description'].value}")
Traverse relationships from a starting node¶
results = db.execute(
"""
MATCH (start:Technology {name: $name})-[r]->(related)
RETURN type(r) AS rel, related.name AS target, r.confidence AS conf
ORDER BY conf DESC
""",
{"name": "GraphForge"},
)
for row in results:
print(f"-[{row['rel'].value}]-> {row['target'].value} ({row['conf'].value:.2f})")
Find connections between two concepts (shortest path)¶
shortestPath() raises NotImplementedError — use a variable-length pattern with
ORDER BY and LIMIT 1 as a BFS equivalent:
results = db.execute(
"""
MATCH path = (a:Technology {name: $from_name})-[*1..6]-(b:Technology {name: $to_name})
RETURN length(path) AS hops
ORDER BY hops ASC LIMIT 1
""",
{"from_name": "Python", "to_name": "SQLite"},
)
if results:
print(f"Connected by {results[0]['hops'].value} hops")
Note: This enumerates all paths up to
max_hopsbefore returning the shortest. Keepmax_hops ≤ 3for graphs above ~1 K nodes.
Aggregate by relationship type¶
results = db.execute(
"""
MATCH ()-[r]->()
RETURN type(r) AS rel_type, count(r) AS total
ORDER BY total DESC
"""
)
for row in results:
print(f"{row['rel_type'].value}: {row['total'].value}")
Provenance and Confidence¶
Store extraction metadata directly on nodes and edges so that every fact in the graph can be traced back to its origin.
Recommended provenance properties
| Property | Type | Description |
|---|---|---|
source |
String | Document ID or URL |
extractedAt |
Datetime | When the fact was extracted |
extractedBy |
String | Model or tool name (e.g. "gpt-4o") |
confidence |
Float | Model confidence or human review score [0, 1] |
confirmations |
Integer | How many independent extractions agree |
db.execute(
"""
MERGE (a:Technology {name: $from_name})
ON CREATE SET a.source = $source, a.extractedBy = $model
MERGE (b:Technology {name: $to_name})
ON CREATE SET b.source = $source, b.extractedBy = $model
MERGE (a)-[r:DEPENDS_ON]->(b)
ON CREATE SET r.confidence = $confidence,
r.source = $source,
r.extractedAt = datetime(),
r.extractedBy = $model,
r.confirmations = 1
""",
{
"from_name": "graphforge",
"to_name": "lark",
"source": "doc:github:graphforge/pyproject.toml",
"model": "gpt-4o",
"confidence": 0.97,
},
)
Query to surface low-confidence facts for human review:
results = db.execute(
"""
MATCH (a)-[r]->(b)
WHERE r.confidence < 0.7
RETURN a.name AS from_node, type(r) AS rel, b.name AS to_node,
r.confidence AS conf, r.source AS src
ORDER BY conf ASC
LIMIT 20
"""
)
for row in results:
print(
f"{row['from_node'].value} -[{row['rel'].value}]-> {row['to_node'].value}"
f" conf={row['conf'].value:.2f} src={row['src'].value}"
)
Exporting a Focused Subgraph¶
When sharing results or feeding a downstream model, extract a coherent slice of the graph rather than dumping everything.
# Pull all nodes and edges within 2 hops of a seed concept
results = db.execute(
"""
MATCH path = (seed:Technology {name: $seed})-[*1..2]-(related)
WITH nodes(path) AS ns,
relationships(path) AS rs
UNWIND ns AS n
WITH DISTINCT n, rs
RETURN n.name AS node, n.description AS desc,
labels(n)[0] AS label
ORDER BY label, node
""",
{"seed": "GraphForge"},
)
subgraph_nodes = [
{"name": row["node"].value, "label": row["label"].value, "desc": row["desc"].value}
for row in results
]
Complete Worked Example¶
Extract Python ecosystem entities from a mock LLM output, build a knowledge graph, then query it.
from graphforge import GraphForge
# -----------------------------------------------------------------
# 1. Mock LLM extraction output
# -----------------------------------------------------------------
llm_output = {
"entities": [
{"label": "Language", "name": "Python", "description": "Dynamic, interpreted language"},
{"label": "Framework", "name": "FastAPI", "description": "Async web framework"},
{"label": "Framework", "name": "Django", "description": "Batteries-included web framework"},
{"label": "Library", "name": "Pydantic", "description": "Data validation using type hints"},
{"label": "Library", "name": "GraphForge", "description": "Embedded graph database"},
{"label": "Library", "name": "SQLAlchemy", "description": "SQL toolkit and ORM"},
{"label": "Library", "name": "Lark", "description": "Parsing toolkit for Python"},
],
"relationships": [
{"from": "FastAPI", "to": "Python", "type": "RUNS_ON", "confidence": 1.0},
{"from": "Django", "to": "Python", "type": "RUNS_ON", "confidence": 1.0},
{"from": "GraphForge", "to": "Python", "type": "RUNS_ON", "confidence": 1.0},
{"from": "FastAPI", "to": "Pydantic", "type": "DEPENDS_ON", "confidence": 0.98},
{"from": "Django", "to": "SQLAlchemy", "type": "DEPENDS_ON", "confidence": 0.85},
{"from": "GraphForge", "to": "Lark", "type": "DEPENDS_ON", "confidence": 0.99},
{"from": "GraphForge", "to": "Pydantic", "type": "DEPENDS_ON", "confidence": 0.99},
],
}
# -----------------------------------------------------------------
# 2. Build the graph
# -----------------------------------------------------------------
db = GraphForge() # in-memory for this example; use GraphForge("kg.db") to persist
SOURCE = "doc:example:python-ecosystem"
MODEL = "gpt-4o-mock"
db.begin()
# Load entities
for ent in llm_output["entities"]:
db.execute(
"""
MERGE (e:{label} {{name: $name}})
ON CREATE SET e.description = $description,
e.source = $source,
e.extractedBy = $model,
e.extractedAt = datetime()
""".format(label=ent["label"]),
{
"name": ent["name"],
"description": ent["description"],
"source": SOURCE,
"model": MODEL,
},
)
# Load relationships
for rel in llm_output["relationships"]:
db.execute(
"""
MATCH (a {{name: $from_name}})
MATCH (b {{name: $to_name}})
MERGE (a)-[r:{rel_type}]->(b)
ON CREATE SET r.confidence = $confidence,
r.source = $source,
r.extractedAt = datetime(),
r.extractedBy = $model,
r.confirmations = 1
""".format(rel_type=rel["type"]),
{
"from_name": rel["from"],
"to_name": rel["to"],
"confidence": rel["confidence"],
"source": SOURCE,
"model": MODEL,
},
)
db.commit()
# -----------------------------------------------------------------
# 3. Query the graph
# -----------------------------------------------------------------
# Which frameworks and libraries run on Python?
print("=== Runs on Python ===")
rows = db.execute(
"""
MATCH (thing)-[:RUNS_ON]->(lang:Language {name: 'Python'})
RETURN labels(thing)[0] AS kind, thing.name AS name
ORDER BY kind, name
"""
)
for row in rows:
print(f" [{row['kind'].value}] {row['name'].value}")
# What does GraphForge depend on?
print("\n=== GraphForge dependencies ===")
rows = db.execute(
"""
MATCH (gf:Library {name: 'GraphForge'})-[r:DEPENDS_ON]->(dep)
RETURN dep.name AS dependency, r.confidence AS conf
ORDER BY conf DESC
"""
)
for row in rows:
print(f" {row['dependency'].value} (conf={row['conf'].value:.2f})")
# Which libraries share a common dependency?
print("\n=== Libraries sharing a dependency ===")
rows = db.execute(
"""
MATCH (a)-[:DEPENDS_ON]->(shared)<-[:DEPENDS_ON]-(b)
WHERE a.name < b.name
RETURN a.name AS lib_a, b.name AS lib_b, shared.name AS via
ORDER BY lib_a, lib_b
"""
)
for row in rows:
print(f" {row['lib_a'].value} and {row['lib_b'].value} both depend on {row['via'].value}")
# Summarise by label
print("\n=== Node counts by label ===")
rows = db.execute(
"""
MATCH (n)
RETURN labels(n)[0] AS label, count(n) AS total
ORDER BY total DESC
"""
)
for row in rows:
print(f" {row['label'].value}: {row['total'].value}")
Expected output:
=== Runs on Python ===
[Framework] Django
[Framework] FastAPI
[Library] GraphForge
=== GraphForge dependencies ===
Lark (conf=0.99)
Pydantic (conf=0.99)
=== Libraries sharing a dependency ===
FastAPI and GraphForge both depend on Pydantic
=== Node counts by label ===
Library: 4
Framework: 2
Language: 1
Deduplication with db.search.text()¶
MERGE handles exact-name duplicates, but extracted entities often have surface-form variation:
"United States", "US", "United States of America". db.search.text() provides fuzzy deduplication
before insertion.
Check-Before-Create Pattern¶
Index each entity as it is created, then check for close matches before creating a new one:
def find_or_create_entity(db, name: str, entity_type: str) -> int:
"""Return the node ID for `name`, merging with a close existing entity if found."""
# Search for existing entities with similar names
hits = db.search.text(name, top_k=3)
for hit in hits:
existing_name = hit.ref.properties.get("name")
if existing_name and hit.score > 0.6:
# Close enough — treat as the same entity
return hit.ref.id
# No close match — create a new entity
db.execute(
f"CREATE (e:{entity_type} {{name: $name}})",
{"name": name},
)
rows = db.execute("MATCH (e) WHERE e.name = $name RETURN id(e) AS nid", {"name": name})
nid = rows[0]["nid"].value
# Index the new entity for future deduplication checks
db.search.index_node(nid, name)
return nid
Applying the Pattern¶
# Instead of bare MERGE, route through find_or_create_entity
entities = [
("United States", "Country"),
("US", "Country"), # will match "United States" if score > 0.6
("France", "Country"),
("OpenAI", "Organisation"),
]
for name, etype in entities:
nid = find_or_create_entity(db, name, etype)
print(f"{name} → node {nid}")
Bulk Indexing¶
For large graphs, index_all() rebuilds the FTS index in one pass:
# Index all Entity nodes on name + description properties
db.search.index_all(node_label="Entity", properties=["name", "description"])
Next Steps¶
- Add a second extraction pass over a different document and observe MERGE preventing
duplicates while
confirmationscounts accumulate. - Introduce a
Documentnode and connect every extracted entity to it with aMENTIONED_INrelationship for full document-level provenance. - Persist the graph to disk (
GraphForge("kg.db")) and reload it across sessions without re-running extraction. - Combine this pattern with the AI Agent Grounding guide to build an agent that reasons over a dynamically constructed knowledge graph.