Knowledge Graph Construction¶

Overview¶

Knowledge graphs give structure to information that starts life as unstructured text — research papers, support tickets, product catalogues, codebases, or any corpus where entities and their relationships matter. GraphForge is well suited to this pattern because it is embedded (no server, no network hop), fully openCypher-compatible, and provides MERGE semantics that make incremental, idempotent graph construction natural.

When to use this pattern¶

LLM extraction pipelines — run a language model over documents to pull out entities and relationships, then load each batch into the graph.
Document analysis — index papers, reports, or web pages; connect concepts across documents; answer "what is linked to what?" queries.
Ontology building — define a class hierarchy, populate instances from data, iterate as understanding of the domain evolves.
Multi-source fusion — merge extractions from several models or tools into a single coherent graph, tracking provenance per node and edge.

Schema Design¶

Consistent naming makes Cypher queries readable and avoids accidental duplication.

Label conventions¶

Use PascalCase for labels. One label per logical entity type is a good default; add secondary labels when you need to tag provenance or status.

Person  Technology  Document  Concept  Organisation

Property keys¶

Use camelCase for property keys. Reserve name as the primary human-readable identifier on every node. Add id when you need a stable external key (e.g. a Wikidata QID or DOI).

name  description  id  url  createdAt  confidence  source

Relationship naming¶

Use SCREAMING_SNAKE_CASE for relationship types. Prefer verb phrases that read naturally from left to right.

MENTIONS  RELATES_TO  DEPENDS_ON  PART_OF  WROTE  CITES  IMPLEMENTS

LangChain Integration — `add_graph_documents()`¶

If you are using LangChain's extraction pipeline, GraphForge accepts GraphDocument objects directly. No manual translation required:

from graphforge import GraphForge
# from langchain_community.graphs.graph_document import GraphDocument, Node, Relationship

gf = GraphForge("knowledge_graph.db")

# LangChain extraction output (or duck-typed equivalents)
# doc = GraphDocument(nodes=[...], relationships=[...], source=document)
# gf.add_graph_documents([doc], include_source=True)

add_graph_documents() is idempotent: calling it twice with the same data produces no duplicate nodes or edges. Nodes are merged on id + label; relationships are deduplicated on (source, target, type, properties).

Label safety: all node labels and relationship types are validated before any data is written. A single invalid identifier aborts the entire call with ValueError and leaves the graph unchanged.

Plain dicts are also accepted:

gf.add_graph_documents([
    {
        "nodes": [
            {"id": "python",     "type": "Language",  "properties": {"description": "Dynamic language"}},
            {"id": "graphforge", "type": "Library",   "properties": {"description": "Embedded graph DB"}},
        ],
        "relationships": [
            {"source_id": "graphforge", "target_id": "python", "type": "RUNS_ON", "properties": {}},
        ],
    }
])

Entity Extraction Pattern¶

The canonical Cypher MERGE pattern for loading extracted entities — use this when you need fine-grained control over ON CREATE SET / ON MATCH SET semantics or when you are not using LangChain.

CREATE vs MERGE: CREATE always inserts a new node, even if an identical one already exists. Always use MERGE (or add_graph_documents()) for extraction pipelines — running the same extraction twice with CREATE will double your graph.

from graphforge import GraphForge

db = GraphForge("knowledge_graph.db")   # persistent SQLite

# --- raw extraction output (e.g. from an LLM) ---
entities = [
    {"type": "Technology", "name": "Python",    "description": "General-purpose language"},
    {"type": "Technology", "name": "GraphForge","description": "Embedded graph database"},
    {"type": "Technology", "name": "Python",    "description": "General-purpose language"},  # duplicate
]

# Load all entities idempotently
for entity in entities:
    db.execute(
        """
        MERGE (e:{label} {{name: $name}})
        ON CREATE SET e.description = $description,
                      e.createdAt   = datetime()
        """.format(label=entity["type"]),
        {"name": entity["name"], "description": entity["description"]},
    )

# Confirm: only two Technology nodes despite three inputs
results = db.execute("MATCH (t:Technology) RETURN t.name AS name ORDER BY name")
print([row["name"].value for row in results])
# ['GraphForge', 'Python']

MERGE on name ensures the second Python entry is silently skipped rather than inserted again. ON CREATE SET populates properties only when the node is new, so a second pass does not overwrite values that may have been updated manually.

Relationship Extraction Pattern¶

Once entities are in the graph, link them with MERGE to avoid duplicate edges. Always MATCH both endpoints first — if either is missing the relationship is skipped rather than creating a dangling reference.

relationships = [
    {"from": "GraphForge", "to": "Python", "type": "IMPLEMENTED_IN", "confidence": 0.98},
    {"from": "GraphForge", "to": "Python", "type": "IMPLEMENTED_IN", "confidence": 0.98},  # dup
]

for rel in relationships:
    db.execute(
        """
        MATCH (a {{name: $from_name}})
        MATCH (b {{name: $to_name}})
        MERGE (a)-[r:{rel_type}]->(b)
        ON CREATE SET r.confidence = $confidence,
                      r.createdAt  = datetime()
        """.format(rel_type=rel["type"]),
        {
            "from_name":  rel["from"],
            "to_name":    rel["to"],
            "confidence": rel["confidence"],
        },
    )

results = db.execute(
    "MATCH (:Technology)-[r:IMPLEMENTED_IN]->(:Technology) RETURN count(r) AS total"
)
print(results[0]["total"].value)   # 1 — not 2

Knowledge graphs are rarely built in a single pass. A typical pipeline runs multiple extraction rounds, each adding detail or correcting earlier mistakes.

Add provenance and source tracking¶

# Tag each node with the document it came from
db.execute(
    """
    MATCH (e:Technology {name: $name})
    SET e.source     = $source,
        e.extractedAt = datetime()
    """,
    {"name": "Python", "source": "doc:arxiv:2301.00001"},
)

Update confidence scores¶

When a second extraction confirms an existing relationship, raise its confidence:

db.execute(
    """
    MATCH (a {name: $from_name})-[r:IMPLEMENTED_IN]->(b {name: $to_name})
    SET r.confidence    = CASE WHEN r.confidence < $new_conf
                               THEN $new_conf
                               ELSE r.confidence
                          END,
        r.confirmations = coalesce(r.confirmations, 1) + 1
    """,
    {"from_name": "GraphForge", "to_name": "Python", "new_conf": 0.99},
)

Merge duplicate entities¶

When two nodes represent the same real-world entity (detected by a fuzzy match or a human review step), redirect all edges to the canonical node and delete the alias.

# Suppose 'GraphForge DB' and 'GraphForge' are the same thing
db.begin()
db.execute(
    """
    MATCH (alias:Technology {name: 'GraphForge DB'})
    MATCH (canon:Technology {name: 'GraphForge'})
    WITH alias, canon
    MATCH (alias)-[r]->(target)
    MERGE (canon)-[:{rel_type} {{confidence: r.confidence}}]->(target)
    """.replace("{rel_type}", "IMPLEMENTED_IN")  # repeat per rel type as needed
)
db.execute("MATCH (alias:Technology {name: 'GraphForge DB'}) DETACH DELETE alias")
db.commit()

Querying Patterns¶

Find all entities of a type¶

results = db.execute(
    """
    MATCH (t:Technology)
    RETURN t.name AS name, t.description AS description
    ORDER BY t.name
    """
)
for row in results:
    print(f"{row['name'].value}: {row['description'].value}")

Traverse relationships from a starting node¶

results = db.execute(
    """
    MATCH (start:Technology {name: $name})-[r]->(related)
    RETURN type(r) AS rel, related.name AS target, r.confidence AS conf
    ORDER BY conf DESC
    """,
    {"name": "GraphForge"},
)
for row in results:
    print(f"-[{row['rel'].value}]-> {row['target'].value} ({row['conf'].value:.2f})")

Find connections between two concepts (shortest path)¶

shortestPath() raises NotImplementedError — use a variable-length pattern with ORDER BY and LIMIT 1 as a BFS equivalent:

results = db.execute(
    """
    MATCH path = (a:Technology {name: $from_name})-[*1..6]-(b:Technology {name: $to_name})
    RETURN length(path) AS hops
    ORDER BY hops ASC LIMIT 1
    """,
    {"from_name": "Python", "to_name": "SQLite"},
)
if results:
    print(f"Connected by {results[0]['hops'].value} hops")

Note: This enumerates all paths up to max_hops before returning the shortest. Keep max_hops ≤ 3 for graphs above ~1 K nodes.

Aggregate by relationship type¶

results = db.execute(
    """
    MATCH ()-[r]->()
    RETURN type(r) AS rel_type, count(r) AS total
    ORDER BY total DESC
    """
)
for row in results:
    print(f"{row['rel_type'].value}: {row['total'].value}")

Provenance and Confidence¶

Store extraction metadata directly on nodes and edges so that every fact in the graph can be traced back to its origin.

Recommended provenance properties

Property	Type	Description
`source`	String	Document ID or URL
`extractedAt`	Datetime	When the fact was extracted
`extractedBy`	String	Model or tool name (e.g. `"gpt-4o"`)
`confidence`	Float	Model confidence or human review score [0, 1]
`confirmations`	Integer	How many independent extractions agree

db.execute(
    """
    MERGE (a:Technology {name: $from_name})
      ON CREATE SET a.source = $source, a.extractedBy = $model
    MERGE (b:Technology {name: $to_name})
      ON CREATE SET b.source = $source, b.extractedBy = $model
    MERGE (a)-[r:DEPENDS_ON]->(b)
      ON CREATE SET r.confidence   = $confidence,
                    r.source       = $source,
                    r.extractedAt  = datetime(),
                    r.extractedBy  = $model,
                    r.confirmations = 1
    """,
    {
        "from_name":  "graphforge",
        "to_name":    "lark",
        "source":     "doc:github:graphforge/pyproject.toml",
        "model":      "gpt-4o",
        "confidence": 0.97,
    },
)

Query to surface low-confidence facts for human review:

results = db.execute(
    """
    MATCH (a)-[r]->(b)
    WHERE r.confidence < 0.7
    RETURN a.name AS from_node, type(r) AS rel, b.name AS to_node,
           r.confidence AS conf, r.source AS src
    ORDER BY conf ASC
    LIMIT 20
    """
)
for row in results:
    print(
        f"{row['from_node'].value} -[{row['rel'].value}]-> {row['to_node'].value}"
        f"  conf={row['conf'].value:.2f}  src={row['src'].value}"
    )

Exporting a Focused Subgraph¶

When sharing results or feeding a downstream model, extract a coherent slice of the graph rather than dumping everything.

# Pull all nodes and edges within 2 hops of a seed concept
results = db.execute(
    """
    MATCH path = (seed:Technology {name: $seed})-[*1..2]-(related)
    WITH nodes(path)  AS ns,
         relationships(path) AS rs
    UNWIND ns AS n
    WITH DISTINCT n, rs
    RETURN n.name AS node, n.description AS desc,
           labels(n)[0] AS label
    ORDER BY label, node
    """,
    {"seed": "GraphForge"},
)

subgraph_nodes = [
    {"name": row["node"].value, "label": row["label"].value, "desc": row["desc"].value}
    for row in results
]

Complete Worked Example¶

Extract Python ecosystem entities from a mock LLM output, build a knowledge graph, then query it.

from graphforge import GraphForge

# -----------------------------------------------------------------
# 1. Mock LLM extraction output
# -----------------------------------------------------------------
llm_output = {
    "entities": [
        {"label": "Language",  "name": "Python",       "description": "Dynamic, interpreted language"},
        {"label": "Framework", "name": "FastAPI",       "description": "Async web framework"},
        {"label": "Framework", "name": "Django",        "description": "Batteries-included web framework"},
        {"label": "Library",   "name": "Pydantic",      "description": "Data validation using type hints"},
        {"label": "Library",   "name": "GraphForge",    "description": "Embedded graph database"},
        {"label": "Library",   "name": "SQLAlchemy",    "description": "SQL toolkit and ORM"},
        {"label": "Library",   "name": "Lark",          "description": "Parsing toolkit for Python"},
    ],
    "relationships": [
        {"from": "FastAPI",    "to": "Python",      "type": "RUNS_ON",    "confidence": 1.0},
        {"from": "Django",     "to": "Python",      "type": "RUNS_ON",    "confidence": 1.0},
        {"from": "GraphForge", "to": "Python",      "type": "RUNS_ON",    "confidence": 1.0},
        {"from": "FastAPI",    "to": "Pydantic",    "type": "DEPENDS_ON", "confidence": 0.98},
        {"from": "Django",     "to": "SQLAlchemy",  "type": "DEPENDS_ON", "confidence": 0.85},
        {"from": "GraphForge", "to": "Lark",        "type": "DEPENDS_ON", "confidence": 0.99},
        {"from": "GraphForge", "to": "Pydantic",    "type": "DEPENDS_ON", "confidence": 0.99},
    ],
}

# -----------------------------------------------------------------
# 2. Build the graph
# -----------------------------------------------------------------
db = GraphForge()   # in-memory for this example; use GraphForge("kg.db") to persist

SOURCE  = "doc:example:python-ecosystem"
MODEL   = "gpt-4o-mock"

db.begin()

# Load entities
for ent in llm_output["entities"]:
    db.execute(
        """
        MERGE (e:{label} {{name: $name}})
        ON CREATE SET e.description  = $description,
                      e.source       = $source,
                      e.extractedBy  = $model,
                      e.extractedAt  = datetime()
        """.format(label=ent["label"]),
        {
            "name":        ent["name"],
            "description": ent["description"],
            "source":      SOURCE,
            "model":       MODEL,
        },
    )

# Load relationships
for rel in llm_output["relationships"]:
    db.execute(
        """
        MATCH (a {{name: $from_name}})
        MATCH (b {{name: $to_name}})
        MERGE (a)-[r:{rel_type}]->(b)
        ON CREATE SET r.confidence   = $confidence,
                      r.source       = $source,
                      r.extractedAt  = datetime(),
                      r.extractedBy  = $model,
                      r.confirmations = 1
        """.format(rel_type=rel["type"]),
        {
            "from_name":  rel["from"],
            "to_name":    rel["to"],
            "confidence": rel["confidence"],
            "source":     SOURCE,
            "model":      MODEL,
        },
    )

db.commit()

# -----------------------------------------------------------------
# 3. Query the graph
# -----------------------------------------------------------------

# Which frameworks and libraries run on Python?
print("=== Runs on Python ===")
rows = db.execute(
    """
    MATCH (thing)-[:RUNS_ON]->(lang:Language {name: 'Python'})
    RETURN labels(thing)[0] AS kind, thing.name AS name
    ORDER BY kind, name
    """
)
for row in rows:
    print(f"  [{row['kind'].value}] {row['name'].value}")

# What does GraphForge depend on?
print("\n=== GraphForge dependencies ===")
rows = db.execute(
    """
    MATCH (gf:Library {name: 'GraphForge'})-[r:DEPENDS_ON]->(dep)
    RETURN dep.name AS dependency, r.confidence AS conf
    ORDER BY conf DESC
    """
)
for row in rows:
    print(f"  {row['dependency'].value}  (conf={row['conf'].value:.2f})")

# Which libraries share a common dependency?
print("\n=== Libraries sharing a dependency ===")
rows = db.execute(
    """
    MATCH (a)-[:DEPENDS_ON]->(shared)<-[:DEPENDS_ON]-(b)
    WHERE a.name < b.name
    RETURN a.name AS lib_a, b.name AS lib_b, shared.name AS via
    ORDER BY lib_a, lib_b
    """
)
for row in rows:
    print(f"  {row['lib_a'].value} and {row['lib_b'].value} both depend on {row['via'].value}")

# Summarise by label
print("\n=== Node counts by label ===")
rows = db.execute(
    """
    MATCH (n)
    RETURN labels(n)[0] AS label, count(n) AS total
    ORDER BY total DESC
    """
)
for row in rows:
    print(f"  {row['label'].value}: {row['total'].value}")

Expected output:

=== Runs on Python ===
  [Framework] Django
  [Framework] FastAPI
  [Library] GraphForge

=== GraphForge dependencies ===
  Lark    (conf=0.99)
  Pydantic  (conf=0.99)

=== Libraries sharing a dependency ===
  FastAPI and GraphForge both depend on Pydantic

=== Node counts by label ===
  Library: 4
  Framework: 2
  Language: 1

Deduplication with db.search.text()¶

MERGE handles exact-name duplicates, but extracted entities often have surface-form variation: "United States", "US", "United States of America". db.search.text() provides fuzzy deduplication before insertion.

Check-Before-Create Pattern¶

Index each entity as it is created, then check for close matches before creating a new one:

def find_or_create_entity(db, name: str, entity_type: str) -> int:
    """Return the node ID for `name`, merging with a close existing entity if found."""
    # Search for existing entities with similar names
    hits = db.search.text(name, top_k=3)
    for hit in hits:
        existing_name = hit.ref.properties.get("name")
        if existing_name and hit.score > 0.6:
            # Close enough — treat as the same entity
            return hit.ref.id

    # No close match — create a new entity
    db.execute(
        f"CREATE (e:{entity_type} {{name: $name}})",
        {"name": name},
    )
    rows = db.execute("MATCH (e) WHERE e.name = $name RETURN id(e) AS nid", {"name": name})
    nid = rows[0]["nid"].value

    # Index the new entity for future deduplication checks
    db.search.index_node(nid, name)
    return nid

Applying the Pattern¶

# Instead of bare MERGE, route through find_or_create_entity
entities = [
    ("United States", "Country"),
    ("US",  "Country"),           # will match "United States" if score > 0.6
    ("France", "Country"),
    ("OpenAI", "Organisation"),
]
for name, etype in entities:
    nid = find_or_create_entity(db, name, etype)
    print(f"{name} → node {nid}")

Bulk Indexing¶

For large graphs, index_all() rebuilds the FTS index in one pass:

# Index all Entity nodes on name + description properties
db.search.index_all(node_label="Entity", properties=["name", "description"])

Next Steps¶

Add a second extraction pass over a different document and observe MERGE preventing duplicates while confirmations counts accumulate.
Introduce a Document node and connect every extracted entity to it with a MENTIONED_IN relationship for full document-level provenance.
Persist the graph to disk (GraphForge("kg.db")) and reload it across sessions without re-running extraction.
Combine this pattern with the AI Agent Grounding guide to build an agent that reasons over a dynamically constructed knowledge graph.

Knowledge Graph Construction¶

Overview¶

When to use this pattern¶

Schema Design¶

Label conventions¶

Property keys¶

Relationship naming¶

LangChain Integration — `add_graph_documents()`¶

Entity Extraction Pattern¶

Relationship Extraction Pattern¶

Iterative Refinement¶

Add provenance and source tracking¶

Update confidence scores¶

Merge duplicate entities¶

Querying Patterns¶

Find all entities of a type¶

Traverse relationships from a starting node¶

Find connections between two concepts (shortest path)¶

Aggregate by relationship type¶

Provenance and Confidence¶

Exporting a Focused Subgraph¶

Complete Worked Example¶

Deduplication with db.search.text()¶

Check-Before-Create Pattern¶

Applying the Pattern¶

Bulk Indexing¶

Next Steps¶

Resources¶

Knowledge Graph Construction¶

Overview¶

When to use this pattern¶

Schema Design¶

Label conventions¶

Property keys¶

Relationship naming¶

LangChain Integration — add_graph_documents()¶

Entity Extraction Pattern¶

Relationship Extraction Pattern¶

Iterative Refinement¶

Add provenance and source tracking¶

Update confidence scores¶

Merge duplicate entities¶

Querying Patterns¶

Find all entities of a type¶

Traverse relationships from a starting node¶

Find connections between two concepts (shortest path)¶

Aggregate by relationship type¶

Provenance and Confidence¶

Exporting a Focused Subgraph¶

Complete Worked Example¶

Deduplication with db.search.text()¶

Check-Before-Create Pattern¶

Applying the Pattern¶

Bulk Indexing¶

Next Steps¶

Resources¶

LangChain Integration — `add_graph_documents()`¶