Skip to content

Research: LLM-Powered Extraction-Storage-Retrieval Workflow Validation

Issue: #452
Date: 2026-05-05
Branch: docs/452-llm-workflows-research
Scope: Findings-only. No library code changes.


1. Executive Summary

Every code snippet in docs/use-cases/llm-workflows.md was run against GraphForge v0.3.9 (main as of 2026-05-05). The core extract → store → query → synthesise pipeline works correctly. Two bugs were confirmed and two friction points require doc corrections.

Key findings:

  • Core pipeline solid: MERGE with ON CREATE SET / ON MATCH SET, provenance tracking, variable-length neighbourhood queries, UNWIND, and iterative refinement all pass cleanly.
  • shortestPath() is broken — SyntaxError at parse time. The doc presents it as working. A BFS workaround exists but is impractical above ~2K nodes. Tracks #468.
  • UNION trailing ORDER BY is a doc bugORDER BY after UNION applies only to the second branch, not the combined result. The build_context_for_question() function will silently return incorrectly ordered context for entities with both inbound and outbound edges.
  • bulk_ingest() provides negligible speedup for the doc's MERGE-based ingestion pattern (1.13× measured). It only defers per-edge statistics tracking; the dominant cost is parse → plan → execute per MERGE call. The name is misleading. The lower-level create_node_bulk() / create_relationship_bulk() APIs are ~780× faster by skipping Cypher parsing entirely.
  • begin() overhead scales linearly with graph size (deepcopy): 4 ms at 100 nodes, 42 ms at 1K, 265 ms at 5K. Per-document transactions add ~239% overhead vs no-transaction ingestion. Batch transactions (begin() once for the whole ingestion) are effectively free (−3% vs no-tx) and are the recommended pattern.
  • f-string label/rel_type interpolation in S2, S6a, and S7 is a Cypher injection risk when values come from LLM extraction output. Malformed labels cause parse errors that crash the pipeline silently.
  • to_dicts() is unused throughout — the doc uses .value on every column. to_dicts() is equivalent, already shipped, and reduces boilerplate.
  • Neighbourhood queries are fast: 2–9 ms p50 for 3-hop on 1K–5K node graphs.

5 issues recommended (2 doc bugs, 1 doc enhancement, 1 API naming, 1 feature).


2. Code Snippet Pass/Fail Matrix

All snippets were run with scripts/validate_llm_snippets.py against main.

ID Section Description Status Root Cause
S1 1. Schema MERGE multi-label, datetime(), MENTIONS edges ✅ PASS
S2a 2. Entity Dedup store_entity — f-string label, ON CREATE/MATCH ✅ PASS Injection risk if label from LLM
S2b 2. Entity Dedup store_relationship — f-string rel_type, ON MATCH ✅ PASS Injection risk if rel_type from LLM
S2c 2. Injection Bad label via f-string → SyntaxError ✅ PASS Confirms injection risk is real
S3a 3. Provenance MERGE on rels with confidence/model/datetime() ✅ PASS
S3b 3. Provenance type(r), WHERE conf>=0.85, ORDER BY DESC ⚠️ PARTIAL Functional; doc uses .value throughout — to_dicts() is simpler
S4a 4. Neighbourhood [r*1..N], RETURN DISTINCT, labels() ✅ PASS
S4b 4. shortestPath shortestPath(...) ✅ RESOLVED Now raises NotImplementedError with BFS hint (PR #497); doc already uses BFS workaround
S5 5. Context Builder UNION + trailing ORDER BY confidence DESC ❌ FAIL ORDER BY only applies to second branch (doc bug)
S6a 6. Refinement SET, f-string rel_type, datetime() ✅ PASS Injection risk on rel_type
S6b 6. Filtered Query WHERE conf>=0.8 AND verified=true, type(r) ✅ PASS
S7 7. Mini-Pipeline Full ingest + summarise, UNWIND labels(), idempotency ✅ PASS No tx boundary; f-string injection risk
S8 8. LangChain Cypher query (framework not in env) ⚠️ PARTIAL Cypher PASS; langchain.schema may be deprecated
S9 8. LlamaIndex toLower(), CONTAINS, LIMIT 1 ⚠️ PARTIAL Cypher PASS; framework not in env
S10 8. Generic Pure Python prompt construction ✅ PASS
ADD-1 Safety Label safety regex ^[A-Za-z_][A-Za-z0-9_]*$ ✅ PASS
ADD-2 Ergonomics to_dicts()execute()+.value ✅ PASS Doc should prefer to_dicts()

Summary: 12 PASS / 3 PARTIAL (env-only, Cypher correct) / 2 FAIL


3. Root Cause Analysis

FP-1: shortestPath() — SyntaxError (S4b) ✅ Resolved in v0.3.10 (PR #497)

Resolution: shortestPath() and allShortestPaths() now parse correctly and raise NotImplementedError with a clear BFS workaround hint. llm-workflows.md already uses the BFS pattern; the section heading now accurately says "not yet supported".

Original problem: The parser had zero grammar rules for shortestPath. Any query using it failed immediately at parse time with a confusing SyntaxError:

SyntaxError: Unexpected token Token('IDENTIFIER', 'shortestPath') at line 1, column 11.
Expected one of: ...

Previous doc claim (now corrected): docs/use-cases/llm-workflows.md Section 4 had presented this as a working pattern:

rows = db.execute("""
    MATCH path = shortestPath(
        (a:Entity {canonical: 'alice-chen'})-[*]-(b:Entity {canonical: 'decisionnerd'})
    )
    RETURN [n IN nodes(path) | n.name] AS chain
""")

The summary table also lists shortestPath(…) as a supported feature.

Workaround: BFS via variable-length path with ORDER BY length:

rows = db.execute("""
    MATCH path = (a:Entity {canonical: $src})-[*1..5]-(b:Entity {canonical: $dst})
    RETURN length(path) AS hops
    ORDER BY hops ASC
    LIMIT 1
""", {"src": "alice-chen", "dst": "decisionnerd"})
if rows:
    print(f"Shortest path: {rows[0]['hops'].value} hops")

Caveat: This workaround enumerates all paths up to the max-hop limit. On dense graphs it triggers combinatorial explosion — impractical above ~2K nodes with 3+ hops (see Section 5). Use with a tight hop limit (≤ 3) or on sparse graphs only.

Tracking: #468 (parser fix). The network-analysis research doc (line 465) already noted this and flagged the llm-workflows doc. Issue #477 was filed to fix the doc.


FP-2: UNION Trailing ORDER BY Only Sorts Second Branch (S5)

Problem: build_context_for_question() uses a UNION to combine outbound and inbound edges for a given entity, then applies ORDER BY confidence DESC LIMIT $limit after the second branch:

rows = db.execute(
    "MATCH (src:Entity {canonical: $canonical})-[r]->(dst:Entity) "
    "RETURN src.name AS subject, type(r) AS predicate, dst.name AS object, "
    "       r.confidence AS confidence "
    "UNION "
    "MATCH (src:Entity)-[r]->(dst:Entity {canonical: $canonical}) "
    "RETURN src.name AS subject, type(r) AS predicate, dst.name AS object, "
    "       r.confidence AS confidence "
    "ORDER BY confidence DESC "
    "LIMIT $limit",
    {"canonical": entity_canonical, "limit": max_facts},
)

Root cause: The grammar defines union_statement: query (union_clause query)+. Each query is clause+. ORDER BY is a clause within the second query, not a post-UNION operator. The _execute_union() method in executor.py (line ~4042) concatenates branch results without post-sort. Results are:

[branch_1_results_unsorted] ++ [branch_2_results_sorted_by_confidence]

Confirmed by experiment:

# outbound from seed: conf 0.3, 0.9 (insertion order)
# inbound to seed:    conf 0.6
# Expected with global sort: [0.9, 0.6, 0.3]
# Actual:                    [0.3, 0.9, 0.6]  ← branch1 unordered ++ branch2 sorted

For entities where outbound edges happen to have higher confidence than inbound edges, the bug is invisible (results look correct). It only surfaces when outbound edges have lower confidence than inbound, which is typical for high-confidence MAINTAINS or FOUNDED relationships held by hub entities.

Fix — sort in Python:

rows = db.execute(
    "MATCH (src:Entity {canonical: $canonical})-[r]->(dst:Entity) "
    "RETURN src.name AS subject, type(r) AS predicate, dst.name AS object, "
    "       r.confidence AS confidence "
    "UNION "
    "MATCH (src:Entity)-[r]->(dst:Entity {canonical: $canonical}) "
    "RETURN src.name AS subject, type(r) AS predicate, dst.name AS object, "
    "       r.confidence AS confidence",
    {"canonical": entity_canonical},
)
# Sort combined results in Python, then limit
rows = sorted(rows, key=lambda r: r["confidence"].value, reverse=True)[:max_facts]

Or use to_dicts() and sort the plain dicts:

rows = db.to_dicts(
    "MATCH (src:Entity {canonical: $canonical})-[r]->(dst:Entity) "
    "RETURN src.name AS subject, type(r) AS predicate, dst.name AS object, "
    "       r.confidence AS confidence "
    "UNION "
    "MATCH (src:Entity)-[r]->(dst:Entity {canonical: $canonical}) "
    "RETURN src.name AS subject, type(r) AS predicate, dst.name AS object, "
    "       r.confidence AS confidence",
    {"canonical": entity_canonical},
)
rows = sorted(rows, key=lambda r: r["confidence"], reverse=True)[:max_facts]

FP-3: f-String Label Injection Risk (S2, S6a, S7)

Problem: The doc shows label and relationship type values interpolated directly into Cypher strings via f-strings:

# S2 — label from extraction result
db.execute(
    f"MERGE (e:Entity:{label} {{canonical: $canonical}}) "
    "ON CREATE SET e.name = $name ...",
    ...
)

# S6a — rel_type from extraction result
db.execute(
    f"MATCH (src:Entity {{canonical: $src}})-[r:{rel_type}]->(dst:Entity {{canonical: $dst}}) "
    "SET r.confidence = $conf ...",
    ...
)

In a real pipeline, label and rel_type come from LLM extraction output. LLMs hallucinate arbitrary strings. A value like "Bad Label" (with a space) causes a SyntaxError that crashes the pipeline silently — the document node was already created but the entity MERGE fails, leaving an incomplete ingestion with no error surfaced to the caller.

Labels cannot be parameterized in openCypher (it is a language-level limitation, not a GraphForge limitation). The fix is to validate before interpolation:

import re
_SAFE_IDENTIFIER = re.compile(r"^[A-Za-z_][A-Za-z0-9_]*$")

def safe_label(label: str) -> str:
    """Validate a label or relationship type before Cypher interpolation."""
    if not _SAFE_IDENTIFIER.match(label):
        raise ValueError(f"Unsafe Cypher identifier: {label!r}")
    return label

# Usage
db.execute(
    f"MERGE (e:Entity:{safe_label(label)} {{canonical: $canonical}}) ...",
    ...
)

Tested: Labels with spaces ("Bad Label"), SQL-style keywords ("DROP TABLE"), and numeric starts ("123Start") all fail the regex, while "Person", "Org", "MyEntity_2" pass.


FP-4: to_dicts() Unused Throughout

Every snippet in the doc uses the verbose .value accessor pattern:

for row in rows:
    subj = row["subject"].value
    pred = row["predicate"].value.replace("_", " ").lower()
    obj  = row["object"].value
    conf = row["confidence"].value

to_dicts() (shipped in v0.3.7) auto-unwraps all CypherValue wrappers and returns plain Python dicts. It has ≤0.8% overhead vs manual .value (measured in the agent-grounding research). All snippets could be simplified:

rows = db.to_dicts(
    "MATCH (src:Entity)-[r]->(dst:Entity) WHERE r.confidence >= 0.85 "
    "RETURN src.name AS subject, type(r) AS predicate, dst.name AS object, "
    "       r.confidence AS confidence ORDER BY r.confidence DESC LIMIT 20"
)
for row in rows:                      # row is a plain dict, values are Python types
    print(f"{row['subject']:25s}  --{row['predicate']}-->  {row['object']:25s}")

4. UNION Semantics Deep-Dive

How GraphForge Processes UNION

The grammar defines:

union_statement: query (union_clause query)+
union_clause: "UNION"i "ALL"i? 
query: clause+

Each query is a self-contained pipeline of clauses (MATCH, WHERE, RETURN, ORDER BY, LIMIT, SKIP). An ORDER BY that appears after the second UNION branch is parsed as a clause within that branch, not as a post-UNION operator.

_execute_union() in executor.py: 1. Plans each branch independently (each gets its own Sort operator if it has ORDER BY) 2. Executes each branch in order 3. Concatenates the result lists: results = branch1_results + branch2_results 4. If UNION (not UNION ALL), deduplicates the concatenated list

There is no mechanism to attach a sort to the combined output. The Union operator class has no sort_items field.

Impact on build_context_for_question()

The function retrieves both outgoing and incoming edges for an entity, intending to rank all facts by confidence for the LLM context window. The bug means:

  • Branch 1 (outbound) results arrive in arbitrary insertion order
  • Branch 2 (inbound) results arrive sorted by confidence DESC
  • The LLM receives branch-1 facts first, regardless of their confidence values

For most test data the bug is invisible (branch-1 happens to be higher confidence than branch-2). In realistic knowledge graphs where hub entities accumulate many low-confidence outbound edges from early ingestion, the highest-confidence inbound edges (human-verified facts) will appear after the low-confidence outbound ones.

Alternatives Considered

Approach Supported Global sort Notes
ORDER BY in each branch ✗ (per-branch only) Not useful for combined ranking
ORDER BY after UNION ✅ (applies to 2nd branch) The current doc bug
Python sorted() post-query Recommended
CALL { UNION } RETURN ... ORDER BY ✗ not supported Would require subquery support

Python-side sort is the correct fix. It adds negligible overhead (sorting 30 rows in Python is microseconds).


5. Neighbourhood Expansion Scalability

Query Pattern

The get_neighbourhood() function from the doc:

rows = db.execute(
    "MATCH (seed:Entity {canonical: $canonical})-[r*1.." + str(hops) + "]-(neighbour:Entity) "
    "WHERE neighbour.canonical <> $canonical "
    "RETURN DISTINCT neighbour.canonical AS canonical, neighbour.name AS name, "
    "       labels(neighbour) AS labels",
    {"canonical": canonical},
)

This is an undirected variable-length pattern — it traverses both directions at each hop. The DISTINCT prevents duplicate nodes from multiple paths.

Latency Results

Benchmarked with scripts/benchmark_neighbourhood.py against synthetic random graphs (uniform random edge assignment, avg degree ~4 for sparse and ~10 for medium). Seed node is e0. p50 over 10 iterations.

Benchmarked with scripts/benchmark_neighbourhood.py. Seed node is e0. p50 / p95 over 10 iterations; n is result-set size.

Sparse topology (avg degree ~4):

Config 1-hop p50 1-hop n 2-hop p50 2-hop n 3-hop p50 3-hop n
1K / 2K edges 2.0 ms 0 1.9 ms 0 2.1 ms 0
5K / 10K edges 9.3 ms 5 9.4 ms 20 9.5 ms 69
10K / 20K edges 20.1 ms 4 19.2 ms 15 20.2 ms 53

Medium topology (avg degree ~10):

Config 1-hop p50 1-hop n 2-hop p50 2-hop n 3-hop p50 3-hop n
1K / 5K edges 2.1 ms 6 2.2 ms 68 4.5 ms 496
5K / 25K edges 9.5 ms 12 10.0 ms 125 14.4 ms 1106

Observations:

  1. The seed node e0 in a random sparse graph has zero reachable neighbours at 1–3 hops (the graph's random wiring happens to isolate e0). The latency is the scan overhead, not the traversal. For hub nodes with real connections, expect similar latency until result sets grow large.
  2. Latency is dominated by graph size (scan cost), not result-set size, at sparse densities. At medium density (5K/25K edges, 3-hop n=1106) it rises to ~15 ms — still acceptable for a RAG context step.
  3. At 10K nodes (sparse), 3-hop latency is ~20 ms p50 — still fast for a RAG step. The doc's claim of "10,000+ nodes without configuration changes" holds for neighbourhood queries.
  4. Dense graphs returning 1000+ neighbours in 3 hops start to stress DISTINCT (5K/25K, 3-hop, n=1106 → ~14 ms). Add a LIMIT clause to bound result size before it reaches tens of thousands of rows.

Build Strategy Comparison (1K nodes / 2K edges)

1K nodes / 2K edges comparison:

Strategy Build time Throughput Notes
A: execute(MERGE) — doc pattern 7.5 s ~360 ops/s Parse+plan+exec per call
B: create_node_bulk + create_relationship_bulk in bulk_ingest() 0.01 s ~300,000 ops/s Skips Cypher parse + Pydantic validation
C: execute(CREATE) — no dedup 7.4 s ~370 ops/s Similar to A; tiny MATCH scan saving

Build time by scale (Strategy A, execute(MERGE)):

Config Build time Throughput
1K / 2K edges 8 s ~360 ops/s
1K / 5K edges 21 s ~290 ops/s
5K / 10K edges 200 s ~75 ops/s
5K / 25K edges 500 s ~60 ops/s
10K / 20K edges 800 s ~37 ops/s

Critical finding: Build throughput degrades significantly with graph size due to the MATCH-based deduplication in MERGE requiring a full label scan. At 10K nodes, ingestion via execute(MERGE) takes ~13 minutes for 30K operations. Strategy B (bulk API) is ~780× faster by skipping Cypher parsing and Pydantic validation entirely.

However, Strategy B requires pre-collected NodeRef objects (returned from create_node_bulk()), making it unsuitable for incremental document-by-document ingestion where entity references arrive piecemeal. Strategy A is the natural pattern for LLM extraction workflows; Strategy B is appropriate for batch ETL (load a pre-assembled dataset in one pass).

BFS shortestPath Workaround

rows = db.execute(
    "MATCH path = (a:Entity {canonical: $src})-[*1..5]-(b:Entity {canonical: $dst}) "
    "RETURN length(path) AS hops "
    "ORDER BY hops ASC LIMIT 1",
    {"src": "alice", "dst": "decisionnerd"},
)
Config max_hops p50 ms p95 ms Found
1K sparse 3 2.3 ms 3.1 ms not found (e0→e100 not connected)
1K sparse 5 2.2 ms 2.4 ms not found
5K sparse 3 10.9 ms 11.1 ms not found
5K sparse 5 15.9 ms 16.1 ms not found

The query executes quickly when no path exists (early termination). The synthetic graph used for benchmarking uses uniform random wiring; e0 and e100 happen not to be connected at ≤5 hops in a sparse (avg degree 4) graph. In a real knowledge graph with hub entities (high-degree canonical entities), paths would be found and latency would reflect result materialisation rather than early exit.

Verdict: The BFS query itself adds ≤16 ms overhead at 5K nodes. The practical concern is result set explosion, not raw latency — a query like [*1..5] on a high-degree graph can enumerate millions of paths. Use with an explicit LIMIT 1 and only on graphs where you control the maximum degree. For production path-finding, wait for #468 (native shortestPath).


6. Transaction and Bulk Ingestion Semantics

How Transactions Work

begin() takes a full deep copy of the entire in-memory graph state via copy.deepcopy() (storage/memory.py, snapshot() at line ~486). This copies all nodes, edges, adjacency lists, label/type/property indexes, and statistics counters.

Measured overhead:

Measured with scripts/benchmark_neighbourhood.py (Phase 3 section):

Graph size begin() ms (p50) commit() ms (p50) rollback() ms (p50)
0 nodes / 0 edges 0.00 0.00 0.00
100 nodes / 150 edges 3.83 0.07 0.07
1K nodes / 1.5K edges 41.83 1.02 0.94
5K nodes / 7.5K edges 265.34 7.82 8.33

begin() is O(n+e) — linear with graph size. At 5K nodes it takes 265 ms. commit() and rollback() are fast (sub-10 ms) because they only write/restore the snapshot, not re-derive it.

Implication: begin() doubles the memory footprint of the graph. At 10K nodes, calling begin() per-document in a 100-document ingestion loop adds ~1 s of overhead from repeated deep copies alone.

Rollback Correctness

Verified: ingest 2 nodes inside a transaction, rollback, graph reverts to pre-transaction state. The snapshot is a complete copy — not a diff or WAL — so rollback is always total.

Ingestion Strategy Benchmarks (100 documents)

100 documents, each with 3–5 entity MERGEs and 2–4 relationship MERGEs:

Strategy Time (ms) Overhead vs (a)
(a) No transaction 637 ms
(b) One begin()/commit() per document 2158 ms +239%
© One begin()/commit() for entire batch 618 ms −3%

Findings: - Per-document transactions add ~239% overhead due to repeated deep-copy snapshots (100 × begin() on a growing graph). - Batch transaction (wrap entire ingestion in one begin()/commit()) adds −3% overhead vs no-transaction — the batch snapshot is taken once on an empty graph. - Recommendation: use a single batch transaction per ingestion job, not per document. If atomicity per document is needed, consider checkpointing by document batch (every 10–20 documents).

bulk_ingest() Assessment

bulk_ingest() (defined in api.py, lines 548–567) only sets _defer_stats = True, which skips per-edge avg_degree and unique_source tracking. On exit it calls _flush_deferred_stats() which rebuilds these in O(E) time.

It does not: - Batch SQLite writes (no SQLite interaction during individual execute() calls) - Defer label/property/type indexing (these update per-insert) - Disable Pydantic validation on execute()-based ingestion

Measured speedup when wrapping execute(MERGE) calls: ~1.0× (no consistent benefit — two independent runs yielded 1.13× and 0.79×, within measurement noise)

The dominant cost is parse → plan → execute per Cypher string. Stats deferral is a tiny fraction of that. bulk_ingest() benefits only when using the lower-level create_node_bulk() / create_relationship_bulk() APIs, which skip Pydantic validation and avoid Cypher parsing entirely.

Naming issue: The name bulk_ingest() implies batch write optimization. The actual benefit is limited to statistics tracking deferral. Users combining it with execute(MERGE) calls will see no improvement and may be misled into assuming atomicity (it is not a transaction wrapper).


7. Integration Patterns Assessment

LangChain (S8)

The doc imports from langchain.schema import Document as LCDocument. This import path was deprecated in LangChain 0.1.x; the current path is from langchain_core.schema import Document. The Cypher query pattern is correct:

rows = db.execute(
    "MATCH (e:Entity {canonical: $canonical})-[r]->(other:Entity) "
    "RETURN e.name AS subject, type(r) AS predicate, other.name AS object",
    {"canonical": query_entity},
)

Replace .value access with to_dicts() for cleaner integration.

LlamaIndex (S9)

The GraphForgeRetriever pattern (inheriting BaseRetriever) is correct for llama_index.core >= 0.10. The Cypher query for entity lookup:

rows = self._db.execute(
    "MATCH (e:Entity) WHERE toLower(e.name) CONTAINS toLower($q) "
    "RETURN e.canonical AS canonical LIMIT 1",
    {"q": query_bundle.query_str},
)

This is a keyword match. For production, replace with embedding-based lookup (the doc notes this). The canonical lookup into build_context_for_question() chain works.

Generic Pattern (S10)

The prompt construction pattern is framework-agnostic and correct:

context = build_context_for_question(db, entity_canonical)
messages = [
    {"role": "system",  "content": "Answer using only the facts provided."},
    {"role": "user",    "content": f"{context}\n\nQuestion: {question}"},
]

Once build_context_for_question() is fixed (Python-side sort), this pattern works with any LLM client (OpenAI, Anthropic, Cohere, Ollama, etc.).

Should neighbourhood() or ContextBuilder Ship?

Assessment: No — not as core API.

get_neighbourhood() and build_context_for_question() are schema-dependent: they assume Entity labels and a canonical property. Shipping them in the library would bake in a specific schema that many users will not use. The patterns are short enough (10–20 lines each) to copy from the doc.

Recommendation: Keep them as documented recipes in the use-case doc. Add a separate examples/llm_workflows/ notebook. If demand emerges, consider a graphforge.recipes module (not part of the core API surface).


8. Recommendations

# Recommendation Priority Effort Type
R-1 Fix build_context_for_question() — replace trailing ORDER BY with Python sorted() High Low Doc fix
R-2 Fix shortestPath() example — replace with BFS workaround + caveat, reference #468 High Low Doc fix
R-3 Add safe_label() / safe_identifier() helper to all f-string interpolation examples High Low Doc enhancement
R-4 Replace all .value access in the doc with to_dicts() Medium Low Doc enhancement
R-5 Update bulk_ingest() docstring to clarify it only defers statistics Medium Low API clarification
R-6 Add transaction guidance: recommend batch transaction (one per ingestion job) over per-document Medium Low Doc enhancement
R-7 Update LangChain import from langchain.schema to langchain_core.schema Low Low Doc fix
R-8 Implement shortestPath() in parser/planner/executor Medium High Feature (#468)
R-9 Add post-UNION ORDER BY support (global sort after UNION) Low Medium Feature

9. Issues Filed

Issue Title Priority
#488 docs: fix build_context_for_question() — UNION trailing ORDER BY sorts only second branch High
#489 docs: replace shortestPath() in llm-workflows.md — parser raises SyntaxError High
#490 docs: add safe_identifier() pattern for f-string label/rel_type injection Medium
#491 api: clarify bulk_ingest() docstring — defers statistics only, not batch writes Medium
#492 feat: gf.neighbourhood(canonical, hops) convenience method for n-hop expansion Low

Appendix A: Corrected Snippets

A-1: get_neighbourhood() — no changes needed (works as documented)

def get_neighbourhood(db, canonical: str, hops: int = 2) -> list[dict]:
    """Return all entities reachable within `hops` steps from `canonical`."""
    return db.to_dicts(
        "MATCH (seed:Entity {canonical: $canonical})-[r*1.." + str(hops) + "]-(neighbour:Entity) "
        "WHERE neighbour.canonical <> $canonical "
        "RETURN DISTINCT "
        "    neighbour.name AS name, "
        "    neighbour.canonical AS canonical, "
        "    labels(neighbour) AS labels",
        {"canonical": canonical},
    )

A-2: build_context_for_question() — fix UNION ORDER BY

def build_context_for_question(db, entity_canonical: str, max_facts: int = 30) -> str:
    """Build a compact context string from the graph for use in an LLM prompt."""
    # Fetch combined outbound + inbound edges (no ORDER BY — apply in Python)
    rows = db.to_dicts(
        "MATCH (src:Entity {canonical: $canonical})-[r]->(dst:Entity) "
        "RETURN src.name AS subject, type(r) AS predicate, dst.name AS object, "
        "       r.confidence AS confidence "
        "UNION "
        "MATCH (src:Entity)-[r]->(dst:Entity {canonical: $canonical}) "
        "RETURN src.name AS subject, type(r) AS predicate, dst.name AS object, "
        "       r.confidence AS confidence",
        {"canonical": entity_canonical},
    )

    if not rows:
        return "No relevant facts found in the knowledge graph."

    # Sort combined results globally, then limit
    rows = sorted(rows, key=lambda r: r["confidence"], reverse=True)[:max_facts]

    lines = ["Relevant facts from the knowledge graph:\n"]
    for row in rows:
        pred = row["predicate"].replace("_", " ").lower()
        conf = row["confidence"]
        lines.append(f"  - {row['subject']} {pred} {row['object']}  (confidence: {conf:.0%})")

    return "\n".join(lines)

A-3: shortestPath() BFS workaround

import re

_SAFE_IDENTIFIER = re.compile(r"^[A-Za-z_][A-Za-z0-9_]*$")

def safe_identifier(value: str) -> str:
    """Validate a label or relationship type before Cypher f-string interpolation."""
    if not _SAFE_IDENTIFIER.match(value):
        raise ValueError(f"Unsafe Cypher identifier: {value!r}")
    return value


def find_shortest_path(db, src_canonical: str, dst_canonical: str, max_hops: int = 4):
    """
    BFS-based shortest path workaround for shortestPath() (#468).
    Only practical for small graphs (< 2K nodes) or tight hop limits (<= 3).
    """
    rows = db.to_dicts(
        f"MATCH path = (a:Entity {{canonical: $src}})-[*1..{max_hops}]-(b:Entity {{canonical: $dst}}) "
        "RETURN length(path) AS hops "
        "ORDER BY hops ASC LIMIT 1",
        {"src": src_canonical, "dst": dst_canonical},
    )
    return rows[0]["hops"] if rows else None
import re
from graphforge import GraphForge

_SAFE_IDENTIFIER = re.compile(r"^[A-Za-z_][A-Za-z0-9_]*$")

def safe_id(value: str) -> str:
    if not _SAFE_IDENTIFIER.match(value):
        raise ValueError(f"Unsafe Cypher identifier: {value!r}")
    return value


def ingest_documents(db: GraphForge, documents: list[dict], model: str = "mock-v1"):
    """
    Process documents with a single batch transaction for atomicity.
    Validates label/rel_type identifiers before interpolation.
    """
    db.begin()
    try:
        for doc in documents:
            db.execute(
                "MERGE (d:Document {id: $id}) "
                "ON CREATE SET d.title = $title, d.ingested = datetime() "
                "ON MATCH  SET d.title = $title",
                {"id": doc["id"], "title": doc["title"]},
            )
            extraction = extract(doc["text"])  # your LLM call here

            for ent in extraction["entities"]:
                label = safe_id(ent["label"])   # validate before interpolation
                db.execute(
                    f"MERGE (e:Entity:{label} {{canonical: $canonical}}) "
                    "ON CREATE SET e.name = $name, e.created = datetime() "
                    "ON MATCH  SET e.name = $name",
                    {"canonical": ent["canonical"], "name": ent["name"]},
                )
                db.execute(
                    "MATCH (d:Document {id: $doc_id}), (e:Entity {canonical: $canonical}) "
                    "MERGE (d)-[:MENTIONS]->(e)",
                    {"doc_id": doc["id"], "canonical": ent["canonical"]},
                )

            for rel in extraction["relationships"]:
                rel_type = safe_id(rel["type"])  # validate before interpolation
                db.execute(
                    "MATCH (src:Entity {canonical: $src}), (dst:Entity {canonical: $dst}) "
                    f"MERGE (src)-[r:{rel_type}]->(dst) "
                    "ON CREATE SET r.confidence = $conf, r.model = $model, "
                    "             r.created = datetime() "
                    "ON MATCH  SET r.confidence = $conf, r.model = $model, "
                    "             r.updated = datetime()",
                    {
                        "src":   rel["src"],
                        "dst":   rel["dst"],
                        "conf":  rel["confidence"],
                        "model": model,
                    },
                )
        db.commit()
    except Exception:
        db.rollback()
        raise

Appendix B: Benchmark Methodology

Environment: macOS 15.4, Apple Silicon. All timings are wall-clock (single process, no parallelism). GraphForge is in-memory (GraphForge()) unless noted.

Neighbourhood benchmark: scripts/benchmark_neighbourhood.py. Graphs are random Erdős–Rényi with fixed seed=42. Latency is median (p50) over 10 iterations from the same seed node (e0).

Transaction benchmark: Inline script (not committed — see Phase 3 probe). 100 synthetic documents with 3–5 entity MERGEs and 2–4 relationship MERGEs per document. Entities drawn from a pool of 50 canonical IDs (many collisions, exercises ON MATCH path). Timing is total wall-clock for all 100 documents.

bulk_ingest() speedup: 500 nodes + 500 edges, execute(MERGE) pattern (not create_node_bulk). Measured with and without bulk_ingest() context manager.