Research: LLM-Powered Extraction-Storage-Retrieval Workflow Validation¶

Issue: #452
Date: 2026-05-05
Branch: docs/452-llm-workflows-research
Scope: Findings-only. No library code changes.

1. Executive Summary¶

Every code snippet in docs/use-cases/llm-workflows.md was run against GraphForge v0.3.9 (main as of 2026-05-05). The core extract → store → query → synthesise pipeline works correctly. Two bugs were confirmed and two friction points require doc corrections.

Key findings:

Core pipeline solid: MERGE with ON CREATE SET / ON MATCH SET, provenance tracking, variable-length neighbourhood queries, UNWIND, and iterative refinement all pass cleanly.
shortestPath() is broken — SyntaxError at parse time. The doc presents it as working. A BFS workaround exists but is impractical above ~2K nodes. Tracks #468.
UNION trailing ORDER BY is a doc bug — ORDER BY after UNION applies only to the second branch, not the combined result. The build_context_for_question() function will silently return incorrectly ordered context for entities with both inbound and outbound edges.
bulk_ingest() provides negligible speedup for the doc's MERGE-based ingestion pattern (1.13× measured). It only defers per-edge statistics tracking; the dominant cost is parse → plan → execute per MERGE call. The name is misleading. The lower-level create_node_bulk() / create_relationship_bulk() APIs are ~780× faster by skipping Cypher parsing entirely.
begin() overhead scales linearly with graph size (deepcopy): 4 ms at 100 nodes, 42 ms at 1K, 265 ms at 5K. Per-document transactions add ~239% overhead vs no-transaction ingestion. Batch transactions (begin() once for the whole ingestion) are effectively free (−3% vs no-tx) and are the recommended pattern.
f-string label/rel_type interpolation in S2, S6a, and S7 is a Cypher injection risk when values come from LLM extraction output. Malformed labels cause parse errors that crash the pipeline silently.
to_dicts() is unused throughout — the doc uses .value on every column. to_dicts() is equivalent, already shipped, and reduces boilerplate.
Neighbourhood queries are fast: 2–9 ms p50 for 3-hop on 1K–5K node graphs.

5 issues recommended (2 doc bugs, 1 doc enhancement, 1 API naming, 1 feature).

2. Code Snippet Pass/Fail Matrix¶

All snippets were run with scripts/validate_llm_snippets.py against main.

ID	Section	Description	Status	Root Cause
S1	1. Schema	MERGE multi-label, datetime(), MENTIONS edges	✅ PASS	—
S2a	2. Entity Dedup	`store_entity` — f-string label, ON CREATE/MATCH	✅ PASS	Injection risk if label from LLM
S2b	2. Entity Dedup	`store_relationship` — f-string rel_type, ON MATCH	✅ PASS	Injection risk if rel_type from LLM
S2c	2. Injection	Bad label via f-string → SyntaxError	✅ PASS	Confirms injection risk is real
S3a	3. Provenance	MERGE on rels with confidence/model/datetime()	✅ PASS	—
S3b	3. Provenance	`type(r)`, WHERE conf>=0.85, ORDER BY DESC	⚠️ PARTIAL	Functional; doc uses `.value` throughout — `to_dicts()` is simpler
S4a	4. Neighbourhood	`[r*1..N]`, RETURN DISTINCT, `labels()`	✅ PASS	—
S4b	4. shortestPath	`shortestPath(...)`	✅ RESOLVED	Now raises `NotImplementedError` with BFS hint (PR #497); doc already uses BFS workaround
S5	5. Context Builder	UNION + trailing `ORDER BY confidence DESC`	❌ FAIL	ORDER BY only applies to second branch (doc bug)
S6a	6. Refinement	`SET`, f-string rel_type, `datetime()`	✅ PASS	Injection risk on rel_type
S6b	6. Filtered Query	`WHERE conf>=0.8 AND verified=true`, `type(r)`	✅ PASS	—
S7	7. Mini-Pipeline	Full ingest + summarise, UNWIND labels(), idempotency	✅ PASS	No tx boundary; f-string injection risk
S8	8. LangChain	Cypher query (framework not in env)	⚠️ PARTIAL	Cypher PASS; `langchain.schema` may be deprecated
S9	8. LlamaIndex	`toLower()`, `CONTAINS`, `LIMIT 1`	⚠️ PARTIAL	Cypher PASS; framework not in env
S10	8. Generic	Pure Python prompt construction	✅ PASS	—
ADD-1	Safety	Label safety regex `^[A-Za-z_][A-Za-z0-9_]*$`	✅ PASS	—
ADD-2	Ergonomics	`to_dicts()` ≡ `execute()+.value`	✅ PASS	Doc should prefer `to_dicts()`

Summary: 12 PASS / 3 PARTIAL (env-only, Cypher correct) / 2 FAIL

3. Root Cause Analysis¶

FP-1: `shortestPath()` — SyntaxError (S4b) ✅ Resolved in v0.3.10 (PR #497)¶

Resolution: shortestPath() and allShortestPaths() now parse correctly and raise NotImplementedError with a clear BFS workaround hint. llm-workflows.md already uses the BFS pattern; the section heading now accurately says "not yet supported".

Original problem: The parser had zero grammar rules for shortestPath. Any query using it failed immediately at parse time with a confusing SyntaxError:

SyntaxError: Unexpected token Token('IDENTIFIER', 'shortestPath') at line 1, column 11.
Expected one of: ...

Previous doc claim (now corrected): docs/use-cases/llm-workflows.md Section 4 had presented this as a working pattern:

rows = db.execute("""
    MATCH path = shortestPath(
        (a:Entity {canonical: 'alice-chen'})-[*]-(b:Entity {canonical: 'decisionnerd'})
    )
    RETURN [n IN nodes(path) | n.name] AS chain
""")

The summary table also lists shortestPath(…) as a supported feature.

Workaround: BFS via variable-length path with ORDER BY length:

rows = db.execute("""
    MATCH path = (a:Entity {canonical: $src})-[*1..5]-(b:Entity {canonical: $dst})
    RETURN length(path) AS hops
    ORDER BY hops ASC
    LIMIT 1
""", {"src": "alice-chen", "dst": "decisionnerd"})
if rows:
    print(f"Shortest path: {rows[0]['hops'].value} hops")

Caveat: This workaround enumerates all paths up to the max-hop limit. On dense graphs it triggers combinatorial explosion — impractical above ~2K nodes with 3+ hops (see Section 5). Use with a tight hop limit (≤ 3) or on sparse graphs only.

Tracking: #468 (parser fix). The network-analysis research doc (line 465) already noted this and flagged the llm-workflows doc. Issue #477 was filed to fix the doc.

FP-2: UNION Trailing ORDER BY Only Sorts Second Branch (S5)¶

Problem: build_context_for_question() uses a UNION to combine outbound and inbound edges for a given entity, then applies ORDER BY confidence DESC LIMIT $limit after the second branch:

rows = db.execute(
    "MATCH (src:Entity {canonical: $canonical})-[r]->(dst:Entity) "
    "RETURN src.name AS subject, type(r) AS predicate, dst.name AS object, "
    "       r.confidence AS confidence "
    "UNION "
    "MATCH (src:Entity)-[r]->(dst:Entity {canonical: $canonical}) "
    "RETURN src.name AS subject, type(r) AS predicate, dst.name AS object, "
    "       r.confidence AS confidence "
    "ORDER BY confidence DESC "
    "LIMIT $limit",
    {"canonical": entity_canonical, "limit": max_facts},
)

Root cause: The grammar defines union_statement: query (union_clause query)+. Each query is clause+. ORDER BY is a clause within the second query, not a post-UNION operator. The _execute_union() method in executor.py (line ~4042) concatenates branch results without post-sort. Results are:

[branch_1_results_unsorted] ++ [branch_2_results_sorted_by_confidence]

Confirmed by experiment:

# outbound from seed: conf 0.3, 0.9 (insertion order)
# inbound to seed:    conf 0.6
# Expected with global sort: [0.9, 0.6, 0.3]
# Actual:                    [0.3, 0.9, 0.6]  ← branch1 unordered ++ branch2 sorted

For entities where outbound edges happen to have higher confidence than inbound edges, the bug is invisible (results look correct). It only surfaces when outbound edges have lower confidence than inbound, which is typical for high-confidence MAINTAINS or FOUNDED relationships held by hub entities.

Fix — sort in Python:

rows = db.execute(
    "MATCH (src:Entity {canonical: $canonical})-[r]->(dst:Entity) "
    "RETURN src.name AS subject, type(r) AS predicate, dst.name AS object, "
    "       r.confidence AS confidence "
    "UNION "
    "MATCH (src:Entity)-[r]->(dst:Entity {canonical: $canonical}) "
    "RETURN src.name AS subject, type(r) AS predicate, dst.name AS object, "
    "       r.confidence AS confidence",
    {"canonical": entity_canonical},
)
# Sort combined results in Python, then limit
rows = sorted(rows, key=lambda r: r["confidence"].value, reverse=True)[:max_facts]

Or use to_dicts() and sort the plain dicts:

rows = db.to_dicts(
    "MATCH (src:Entity {canonical: $canonical})-[r]->(dst:Entity) "
    "RETURN src.name AS subject, type(r) AS predicate, dst.name AS object, "
    "       r.confidence AS confidence "
    "UNION "
    "MATCH (src:Entity)-[r]->(dst:Entity {canonical: $canonical}) "
    "RETURN src.name AS subject, type(r) AS predicate, dst.name AS object, "
    "       r.confidence AS confidence",
    {"canonical": entity_canonical},
)
rows = sorted(rows, key=lambda r: r["confidence"], reverse=True)[:max_facts]

FP-3: f-String Label Injection Risk (S2, S6a, S7)¶

Problem: The doc shows label and relationship type values interpolated directly into Cypher strings via f-strings:

# S2 — label from extraction result
db.execute(
    f"MERGE (e:Entity:{label} {{canonical: $canonical}}) "
    "ON CREATE SET e.name = $name ...",
    ...
)

# S6a — rel_type from extraction result
db.execute(
    f"MATCH (src:Entity {{canonical: $src}})-[r:{rel_type}]->(dst:Entity {{canonical: $dst}}) "
    "SET r.confidence = $conf ...",
    ...
)

In a real pipeline, label and rel_type come from LLM extraction output. LLMs hallucinate arbitrary strings. A value like "Bad Label" (with a space) causes a SyntaxError that crashes the pipeline silently — the document node was already created but the entity MERGE fails, leaving an incomplete ingestion with no error surfaced to the caller.

Labels cannot be parameterized in openCypher (it is a language-level limitation, not a GraphForge limitation). The fix is to validate before interpolation:

import re
_SAFE_IDENTIFIER = re.compile(r"^[A-Za-z_][A-Za-z0-9_]*$")

def safe_label(label: str) -> str:
    """Validate a label or relationship type before Cypher interpolation."""
    if not _SAFE_IDENTIFIER.match(label):
        raise ValueError(f"Unsafe Cypher identifier: {label!r}")
    return label

# Usage
db.execute(
    f"MERGE (e:Entity:{safe_label(label)} {{canonical: $canonical}}) ...",
    ...
)

Tested: Labels with spaces ("Bad Label"), SQL-style keywords ("DROP TABLE"), and numeric starts ("123Start") all fail the regex, while "Person", "Org", "MyEntity_2" pass.

FP-4: `to_dicts()` Unused Throughout¶

Every snippet in the doc uses the verbose .value accessor pattern:

for row in rows:
    subj = row["subject"].value
    pred = row["predicate"].value.replace("_", " ").lower()
    obj  = row["object"].value
    conf = row["confidence"].value

to_dicts() (shipped in v0.3.7) auto-unwraps all CypherValue wrappers and returns plain Python dicts. It has ≤0.8% overhead vs manual .value (measured in the agent-grounding research). All snippets could be simplified:

rows = db.to_dicts(
    "MATCH (src:Entity)-[r]->(dst:Entity) WHERE r.confidence >= 0.85 "
    "RETURN src.name AS subject, type(r) AS predicate, dst.name AS object, "
    "       r.confidence AS confidence ORDER BY r.confidence DESC LIMIT 20"
)
for row in rows:                      # row is a plain dict, values are Python types
    print(f"{row['subject']:25s}  --{row['predicate']}-->  {row['object']:25s}")

4. UNION Semantics Deep-Dive¶

How GraphForge Processes UNION¶

The grammar defines:

union_statement: query (union_clause query)+
union_clause: "UNION"i "ALL"i? 
query: clause+

Each query is a self-contained pipeline of clauses (MATCH, WHERE, RETURN, ORDER BY, LIMIT, SKIP). An ORDER BY that appears after the second UNION branch is parsed as a clause within that branch, not as a post-UNION operator.

_execute_union() in executor.py: 1. Plans each branch independently (each gets its own Sort operator if it has ORDER BY) 2. Executes each branch in order 3. Concatenates the result lists: results = branch1_results + branch2_results 4. If UNION (not UNION ALL), deduplicates the concatenated list

There is no mechanism to attach a sort to the combined output. The Union operator class has no sort_items field.

Impact on `build_context_for_question()`¶

The function retrieves both outgoing and incoming edges for an entity, intending to rank all facts by confidence for the LLM context window. The bug means:

Branch 1 (outbound) results arrive in arbitrary insertion order
Branch 2 (inbound) results arrive sorted by confidence DESC
The LLM receives branch-1 facts first, regardless of their confidence values

For most test data the bug is invisible (branch-1 happens to be higher confidence than branch-2). In realistic knowledge graphs where hub entities accumulate many low-confidence outbound edges from early ingestion, the highest-confidence inbound edges (human-verified facts) will appear after the low-confidence outbound ones.

Alternatives Considered¶

Approach	Supported	Global sort	Notes
`ORDER BY` in each branch	✅	✗ (per-branch only)	Not useful for combined ranking
`ORDER BY` after UNION	✅ (applies to 2^nd branch)	✗	The current doc bug
Python `sorted()` post-query	✅	✅	Recommended
`CALL { UNION } RETURN ... ORDER BY`	✗ not supported	✅	Would require subquery support

Python-side sort is the correct fix. It adds negligible overhead (sorting 30 rows in Python is microseconds).

5. Neighbourhood Expansion Scalability¶

Query Pattern¶

The get_neighbourhood() function from the doc:

rows = db.execute(
    "MATCH (seed:Entity {canonical: $canonical})-[r*1.." + str(hops) + "]-(neighbour:Entity) "
    "WHERE neighbour.canonical <> $canonical "
    "RETURN DISTINCT neighbour.canonical AS canonical, neighbour.name AS name, "
    "       labels(neighbour) AS labels",
    {"canonical": canonical},
)

This is an undirected variable-length pattern — it traverses both directions at each hop. The DISTINCT prevents duplicate nodes from multiple paths.

Latency Results¶

Benchmarked with scripts/benchmark_neighbourhood.py against synthetic random graphs (uniform random edge assignment, avg degree ~4 for sparse and ~10 for medium). Seed node is e0. p50 over 10 iterations.

Benchmarked with scripts/benchmark_neighbourhood.py. Seed node is e0. p50 / p95 over 10 iterations; n is result-set size.

Sparse topology (avg degree ~4):

Config	1-hop p50	1-hop n	2-hop p50	2-hop n	3-hop p50	3-hop n
1K / 2K edges	2.0 ms	0	1.9 ms	0	2.1 ms	0
5K / 10K edges	9.3 ms	5	9.4 ms	20	9.5 ms	69
10K / 20K edges	20.1 ms	4	19.2 ms	15	20.2 ms	53

Medium topology (avg degree ~10):

Config	1-hop p50	1-hop n	2-hop p50	2-hop n	3-hop p50	3-hop n
1K / 5K edges	2.1 ms	6	2.2 ms	68	4.5 ms	496
5K / 25K edges	9.5 ms	12	10.0 ms	125	14.4 ms	1106

Observations:

The seed node e0 in a random sparse graph has zero reachable neighbours at 1–3 hops (the graph's random wiring happens to isolate e0). The latency is the scan overhead, not the traversal. For hub nodes with real connections, expect similar latency until result sets grow large.
Latency is dominated by graph size (scan cost), not result-set size, at sparse densities. At medium density (5K/25K edges, 3-hop n=1106) it rises to ~15 ms — still acceptable for a RAG context step.
At 10K nodes (sparse), 3-hop latency is ~20 ms p50 — still fast for a RAG step. The doc's claim of "10,000+ nodes without configuration changes" holds for neighbourhood queries.
Dense graphs returning 1000+ neighbours in 3 hops start to stress DISTINCT (5K/25K, 3-hop, n=1106 → ~14 ms). Add a LIMIT clause to bound result size before it reaches tens of thousands of rows.

Build Strategy Comparison (1K nodes / 2K edges)¶

1K nodes / 2K edges comparison:

Strategy	Build time	Throughput	Notes
A: `execute(MERGE)` — doc pattern	7.5 s	~360 ops/s	Parse+plan+exec per call
B: `create_node_bulk` + `create_relationship_bulk` in `bulk_ingest()`	0.01 s	~300,000 ops/s	Skips Cypher parse + Pydantic validation
C: `execute(CREATE)` — no dedup	7.4 s	~370 ops/s	Similar to A; tiny MATCH scan saving

Build time by scale (Strategy A, execute(MERGE)):

Config	Build time	Throughput
1K / 2K edges	8 s	~360 ops/s
1K / 5K edges	21 s	~290 ops/s
5K / 10K edges	200 s	~75 ops/s
5K / 25K edges	500 s	~60 ops/s
10K / 20K edges	800 s	~37 ops/s

Critical finding: Build throughput degrades significantly with graph size due to the MATCH-based deduplication in MERGE requiring a full label scan. At 10K nodes, ingestion via execute(MERGE) takes ~13 minutes for 30K operations. Strategy B (bulk API) is ~780× faster by skipping Cypher parsing and Pydantic validation entirely.

However, Strategy B requires pre-collected NodeRef objects (returned from create_node_bulk()), making it unsuitable for incremental document-by-document ingestion where entity references arrive piecemeal. Strategy A is the natural pattern for LLM extraction workflows; Strategy B is appropriate for batch ETL (load a pre-assembled dataset in one pass).

BFS shortestPath Workaround¶

rows = db.execute(
    "MATCH path = (a:Entity {canonical: $src})-[*1..5]-(b:Entity {canonical: $dst}) "
    "RETURN length(path) AS hops "
    "ORDER BY hops ASC LIMIT 1",
    {"src": "alice", "dst": "decisionnerd"},
)

Config	max_hops	p50 ms	p95 ms	Found
1K sparse	3	2.3 ms	3.1 ms	not found (e0→e100 not connected)
1K sparse	5	2.2 ms	2.4 ms	not found
5K sparse	3	10.9 ms	11.1 ms	not found
5K sparse	5	15.9 ms	16.1 ms	not found

The query executes quickly when no path exists (early termination). The synthetic graph used for benchmarking uses uniform random wiring; e0 and e100 happen not to be connected at ≤5 hops in a sparse (avg degree 4) graph. In a real knowledge graph with hub entities (high-degree canonical entities), paths would be found and latency would reflect result materialisation rather than early exit.

Verdict: The BFS query itself adds ≤16 ms overhead at 5K nodes. The practical concern is result set explosion, not raw latency — a query like [*1..5] on a high-degree graph can enumerate millions of paths. Use with an explicit LIMIT 1 and only on graphs where you control the maximum degree. For production path-finding, wait for #468 (native shortestPath).

6. Transaction and Bulk Ingestion Semantics¶

How Transactions Work¶

begin() takes a full deep copy of the entire in-memory graph state via copy.deepcopy() (storage/memory.py, snapshot() at line ~486). This copies all nodes, edges, adjacency lists, label/type/property indexes, and statistics counters.

Measured overhead:

Measured with scripts/benchmark_neighbourhood.py (Phase 3 section):

Graph size	`begin()` ms (p50)	`commit()` ms (p50)	`rollback()` ms (p50)
0 nodes / 0 edges	0.00	0.00	0.00
100 nodes / 150 edges	3.83	0.07	0.07
1K nodes / 1.5K edges	41.83	1.02	0.94
5K nodes / 7.5K edges	265.34	7.82	8.33

begin() is O(n+e) — linear with graph size. At 5K nodes it takes 265 ms. commit() and rollback() are fast (sub-10 ms) because they only write/restore the snapshot, not re-derive it.

Implication: begin() doubles the memory footprint of the graph. At 10K nodes, calling begin() per-document in a 100-document ingestion loop adds ~1 s of overhead from repeated deep copies alone.

Rollback Correctness¶

Verified: ingest 2 nodes inside a transaction, rollback, graph reverts to pre-transaction state. The snapshot is a complete copy — not a diff or WAL — so rollback is always total.

Ingestion Strategy Benchmarks (100 documents)¶

100 documents, each with 3–5 entity MERGEs and 2–4 relationship MERGEs:

Strategy	Time (ms)	Overhead vs (a)
(a) No transaction	637 ms	—
(b) One `begin()/commit()` per document	2158 ms	+239%
© One `begin()/commit()` for entire batch	618 ms	−3%

Findings: - Per-document transactions add ~239% overhead due to repeated deep-copy snapshots (100 × begin() on a growing graph). - Batch transaction (wrap entire ingestion in one begin()/commit()) adds −3% overhead vs no-transaction — the batch snapshot is taken once on an empty graph. - Recommendation: use a single batch transaction per ingestion job, not per document. If atomicity per document is needed, consider checkpointing by document batch (every 10–20 documents).

`bulk_ingest()` Assessment¶

bulk_ingest() (defined in api.py, lines 548–567) only sets _defer_stats = True, which skips per-edge avg_degree and unique_source tracking. On exit it calls _flush_deferred_stats() which rebuilds these in O(E) time.

It does not: - Batch SQLite writes (no SQLite interaction during individual execute() calls) - Defer label/property/type indexing (these update per-insert) - Disable Pydantic validation on execute()-based ingestion

Measured speedup when wrapping execute(MERGE) calls: ~1.0× (no consistent benefit — two independent runs yielded 1.13× and 0.79×, within measurement noise)

The dominant cost is parse → plan → execute per Cypher string. Stats deferral is a tiny fraction of that. bulk_ingest() benefits only when using the lower-level create_node_bulk() / create_relationship_bulk() APIs, which skip Pydantic validation and avoid Cypher parsing entirely.

Naming issue: The name bulk_ingest() implies batch write optimization. The actual benefit is limited to statistics tracking deferral. Users combining it with execute(MERGE) calls will see no improvement and may be misled into assuming atomicity (it is not a transaction wrapper).

7. Integration Patterns Assessment¶

LangChain (S8)¶

The doc imports from langchain.schema import Document as LCDocument. This import path was deprecated in LangChain 0.1.x; the current path is from langchain_core.schema import Document. The Cypher query pattern is correct:

rows = db.execute(
    "MATCH (e:Entity {canonical: $canonical})-[r]->(other:Entity) "
    "RETURN e.name AS subject, type(r) AS predicate, other.name AS object",
    {"canonical": query_entity},
)

Replace .value access with to_dicts() for cleaner integration.

LlamaIndex (S9)¶

The GraphForgeRetriever pattern (inheriting BaseRetriever) is correct for llama_index.core >= 0.10. The Cypher query for entity lookup:

rows = self._db.execute(
    "MATCH (e:Entity) WHERE toLower(e.name) CONTAINS toLower($q) "
    "RETURN e.canonical AS canonical LIMIT 1",
    {"q": query_bundle.query_str},
)

This is a keyword match. For production, replace with embedding-based lookup (the doc notes this). The canonical lookup into build_context_for_question() chain works.

Generic Pattern (S10)¶

The prompt construction pattern is framework-agnostic and correct:

context = build_context_for_question(db, entity_canonical)
messages = [
    {"role": "system",  "content": "Answer using only the facts provided."},
    {"role": "user",    "content": f"{context}\n\nQuestion: {question}"},
]

Once build_context_for_question() is fixed (Python-side sort), this pattern works with any LLM client (OpenAI, Anthropic, Cohere, Ollama, etc.).

Should `neighbourhood()` or `ContextBuilder` Ship?¶

Assessment: No — not as core API.

get_neighbourhood() and build_context_for_question() are schema-dependent: they assume Entity labels and a canonical property. Shipping them in the library would bake in a specific schema that many users will not use. The patterns are short enough (10–20 lines each) to copy from the doc.

Recommendation: Keep them as documented recipes in the use-case doc. Add a separate examples/llm_workflows/ notebook. If demand emerges, consider a graphforge.recipes module (not part of the core API surface).

8. Recommendations¶

#	Recommendation	Priority	Effort	Type
R-1	Fix `build_context_for_question()` — replace trailing `ORDER BY` with Python `sorted()`	High	Low	Doc fix
R-2	Fix `shortestPath()` example — replace with BFS workaround + caveat, reference #468	High	Low	Doc fix
R-3	Add `safe_label()` / `safe_identifier()` helper to all f-string interpolation examples	High	Low	Doc enhancement
R-4	Replace all `.value` access in the doc with `to_dicts()`	Medium	Low	Doc enhancement
R-5	Update `bulk_ingest()` docstring to clarify it only defers statistics	Medium	Low	API clarification
R-6	Add transaction guidance: recommend batch transaction (one per ingestion job) over per-document	Medium	Low	Doc enhancement
R-7	Update LangChain import from `langchain.schema` to `langchain_core.schema`	Low	Low	Doc fix
R-8	Implement `shortestPath()` in parser/planner/executor	Medium	High	Feature (#468)
R-9	Add post-UNION `ORDER BY` support (global sort after UNION)	Low	Medium	Feature

9. Issues Filed¶

Issue	Title	Priority
#488	docs: fix `build_context_for_question()` — UNION trailing ORDER BY sorts only second branch	High
#489	docs: replace `shortestPath()` in llm-workflows.md — parser raises SyntaxError	High
#490	docs: add `safe_identifier()` pattern for f-string label/rel_type injection	Medium
#491	api: clarify `bulk_ingest()` docstring — defers statistics only, not batch writes	Medium
#492	feat: `gf.neighbourhood(canonical, hops)` convenience method for n-hop expansion	Low

Appendix A: Corrected Snippets¶

A-1: `get_neighbourhood()` — no changes needed (works as documented)¶

def get_neighbourhood(db, canonical: str, hops: int = 2) -> list[dict]:
    """Return all entities reachable within `hops` steps from `canonical`."""
    return db.to_dicts(
        "MATCH (seed:Entity {canonical: $canonical})-[r*1.." + str(hops) + "]-(neighbour:Entity) "
        "WHERE neighbour.canonical <> $canonical "
        "RETURN DISTINCT "
        "    neighbour.name AS name, "
        "    neighbour.canonical AS canonical, "
        "    labels(neighbour) AS labels",
        {"canonical": canonical},
    )

A-2: `build_context_for_question()` — fix UNION ORDER BY¶

def build_context_for_question(db, entity_canonical: str, max_facts: int = 30) -> str:
    """Build a compact context string from the graph for use in an LLM prompt."""
    # Fetch combined outbound + inbound edges (no ORDER BY — apply in Python)
    rows = db.to_dicts(
        "MATCH (src:Entity {canonical: $canonical})-[r]->(dst:Entity) "
        "RETURN src.name AS subject, type(r) AS predicate, dst.name AS object, "
        "       r.confidence AS confidence "
        "UNION "
        "MATCH (src:Entity)-[r]->(dst:Entity {canonical: $canonical}) "
        "RETURN src.name AS subject, type(r) AS predicate, dst.name AS object, "
        "       r.confidence AS confidence",
        {"canonical": entity_canonical},
    )

    if not rows:
        return "No relevant facts found in the knowledge graph."

    # Sort combined results globally, then limit
    rows = sorted(rows, key=lambda r: r["confidence"], reverse=True)[:max_facts]

    lines = ["Relevant facts from the knowledge graph:\n"]
    for row in rows:
        pred = row["predicate"].replace("_", " ").lower()
        conf = row["confidence"]
        lines.append(f"  - {row['subject']} {pred} {row['object']}  (confidence: {conf:.0%})")

    return "\n".join(lines)

A-3: `shortestPath()` BFS workaround¶

import re

_SAFE_IDENTIFIER = re.compile(r"^[A-Za-z_][A-Za-z0-9_]*$")

def safe_identifier(value: str) -> str:
    """Validate a label or relationship type before Cypher f-string interpolation."""
    if not _SAFE_IDENTIFIER.match(value):
        raise ValueError(f"Unsafe Cypher identifier: {value!r}")
    return value


def find_shortest_path(db, src_canonical: str, dst_canonical: str, max_hops: int = 4):
    """
    BFS-based shortest path workaround for shortestPath() (#468).
    Only practical for small graphs (< 2K nodes) or tight hop limits (<= 3).
    """
    rows = db.to_dicts(
        f"MATCH path = (a:Entity {{canonical: $src}})-[*1..{max_hops}]-(b:Entity {{canonical: $dst}}) "
        "RETURN length(path) AS hops "
        "ORDER BY hops ASC LIMIT 1",
        {"src": src_canonical, "dst": dst_canonical},
    )
    return rows[0]["hops"] if rows else None

A-4: Recommended ingestion pattern with transaction and injection safety¶

import re
from graphforge import GraphForge

_SAFE_IDENTIFIER = re.compile(r"^[A-Za-z_][A-Za-z0-9_]*$")

def safe_id(value: str) -> str:
    if not _SAFE_IDENTIFIER.match(value):
        raise ValueError(f"Unsafe Cypher identifier: {value!r}")
    return value


def ingest_documents(db: GraphForge, documents: list[dict], model: str = "mock-v1"):
    """
    Process documents with a single batch transaction for atomicity.
    Validates label/rel_type identifiers before interpolation.
    """
    db.begin()
    try:
        for doc in documents:
            db.execute(
                "MERGE (d:Document {id: $id}) "
                "ON CREATE SET d.title = $title, d.ingested = datetime() "
                "ON MATCH  SET d.title = $title",
                {"id": doc["id"], "title": doc["title"]},
            )
            extraction = extract(doc["text"])  # your LLM call here

            for ent in extraction["entities"]:
                label = safe_id(ent["label"])   # validate before interpolation
                db.execute(
                    f"MERGE (e:Entity:{label} {{canonical: $canonical}}) "
                    "ON CREATE SET e.name = $name, e.created = datetime() "
                    "ON MATCH  SET e.name = $name",
                    {"canonical": ent["canonical"], "name": ent["name"]},
                )
                db.execute(
                    "MATCH (d:Document {id: $doc_id}), (e:Entity {canonical: $canonical}) "
                    "MERGE (d)-[:MENTIONS]->(e)",
                    {"doc_id": doc["id"], "canonical": ent["canonical"]},
                )

            for rel in extraction["relationships"]:
                rel_type = safe_id(rel["type"])  # validate before interpolation
                db.execute(
                    "MATCH (src:Entity {canonical: $src}), (dst:Entity {canonical: $dst}) "
                    f"MERGE (src)-[r:{rel_type}]->(dst) "
                    "ON CREATE SET r.confidence = $conf, r.model = $model, "
                    "             r.created = datetime() "
                    "ON MATCH  SET r.confidence = $conf, r.model = $model, "
                    "             r.updated = datetime()",
                    {
                        "src":   rel["src"],
                        "dst":   rel["dst"],
                        "conf":  rel["confidence"],
                        "model": model,
                    },
                )
        db.commit()
    except Exception:
        db.rollback()
        raise

Appendix B: Benchmark Methodology¶

Environment: macOS 15.4, Apple Silicon. All timings are wall-clock (single process, no parallelism). GraphForge is in-memory (GraphForge()) unless noted.

Neighbourhood benchmark: scripts/benchmark_neighbourhood.py. Graphs are random Erdős–Rényi with fixed seed=42. Latency is median (p50) over 10 iterations from the same seed node (e0).

Transaction benchmark: Inline script (not committed — see Phase 3 probe). 100 synthetic documents with 3–5 entity MERGEs and 2–4 relationship MERGEs per document. Entities drawn from a pool of 50 canonical IDs (many collisions, exercises ON MATCH path). Timing is total wall-clock for all 100 documents.

bulk_ingest() speedup: 500 nodes + 500 edges, execute(MERGE) pattern (not create_node_bulk). Measured with and without bulk_ingest() context manager.