Research: LLM-Powered Extraction-Storage-Retrieval Workflow Validation¶
Issue: #452
Date: 2026-05-05
Branch: docs/452-llm-workflows-research
Scope: Findings-only. No library code changes.
1. Executive Summary¶
Every code snippet in docs/use-cases/llm-workflows.md was run against GraphForge
v0.3.9 (main as of 2026-05-05). The core extract → store → query → synthesise
pipeline works correctly. Two bugs were confirmed and two friction points require doc
corrections.
Key findings:
- Core pipeline solid: MERGE with
ON CREATE SET/ON MATCH SET, provenance tracking, variable-length neighbourhood queries, UNWIND, and iterative refinement all pass cleanly. shortestPath()is broken — SyntaxError at parse time. The doc presents it as working. A BFS workaround exists but is impractical above ~2K nodes. Tracks #468.- UNION trailing ORDER BY is a doc bug —
ORDER BYafterUNIONapplies only to the second branch, not the combined result. Thebuild_context_for_question()function will silently return incorrectly ordered context for entities with both inbound and outbound edges. bulk_ingest()provides negligible speedup for the doc's MERGE-based ingestion pattern (1.13× measured). It only defers per-edge statistics tracking; the dominant cost is parse → plan → execute perMERGEcall. The name is misleading. The lower-levelcreate_node_bulk()/create_relationship_bulk()APIs are ~780× faster by skipping Cypher parsing entirely.begin()overhead scales linearly with graph size (deepcopy): 4 ms at 100 nodes, 42 ms at 1K, 265 ms at 5K. Per-document transactions add ~239% overhead vs no-transaction ingestion. Batch transactions (begin()once for the whole ingestion) are effectively free (−3% vs no-tx) and are the recommended pattern.- f-string label/rel_type interpolation in S2, S6a, and S7 is a Cypher injection risk when values come from LLM extraction output. Malformed labels cause parse errors that crash the pipeline silently.
to_dicts()is unused throughout — the doc uses.valueon every column.to_dicts()is equivalent, already shipped, and reduces boilerplate.- Neighbourhood queries are fast: 2–9 ms p50 for 3-hop on 1K–5K node graphs.
5 issues recommended (2 doc bugs, 1 doc enhancement, 1 API naming, 1 feature).
2. Code Snippet Pass/Fail Matrix¶
All snippets were run with scripts/validate_llm_snippets.py against main.
| ID | Section | Description | Status | Root Cause |
|---|---|---|---|---|
| S1 | 1. Schema | MERGE multi-label, datetime(), MENTIONS edges | ✅ PASS | — |
| S2a | 2. Entity Dedup | store_entity — f-string label, ON CREATE/MATCH |
✅ PASS | Injection risk if label from LLM |
| S2b | 2. Entity Dedup | store_relationship — f-string rel_type, ON MATCH |
✅ PASS | Injection risk if rel_type from LLM |
| S2c | 2. Injection | Bad label via f-string → SyntaxError | ✅ PASS | Confirms injection risk is real |
| S3a | 3. Provenance | MERGE on rels with confidence/model/datetime() | ✅ PASS | — |
| S3b | 3. Provenance | type(r), WHERE conf>=0.85, ORDER BY DESC |
⚠️ PARTIAL | Functional; doc uses .value throughout — to_dicts() is simpler |
| S4a | 4. Neighbourhood | [r*1..N], RETURN DISTINCT, labels() |
✅ PASS | — |
| S4b | 4. shortestPath | shortestPath(...) |
✅ RESOLVED | Now raises NotImplementedError with BFS hint (PR #497); doc already uses BFS workaround |
| S5 | 5. Context Builder | UNION + trailing ORDER BY confidence DESC |
❌ FAIL | ORDER BY only applies to second branch (doc bug) |
| S6a | 6. Refinement | SET, f-string rel_type, datetime() |
✅ PASS | Injection risk on rel_type |
| S6b | 6. Filtered Query | WHERE conf>=0.8 AND verified=true, type(r) |
✅ PASS | — |
| S7 | 7. Mini-Pipeline | Full ingest + summarise, UNWIND labels(), idempotency | ✅ PASS | No tx boundary; f-string injection risk |
| S8 | 8. LangChain | Cypher query (framework not in env) | ⚠️ PARTIAL | Cypher PASS; langchain.schema may be deprecated |
| S9 | 8. LlamaIndex | toLower(), CONTAINS, LIMIT 1 |
⚠️ PARTIAL | Cypher PASS; framework not in env |
| S10 | 8. Generic | Pure Python prompt construction | ✅ PASS | — |
| ADD-1 | Safety | Label safety regex ^[A-Za-z_][A-Za-z0-9_]*$ |
✅ PASS | — |
| ADD-2 | Ergonomics | to_dicts() ≡ execute()+.value |
✅ PASS | Doc should prefer to_dicts() |
Summary: 12 PASS / 3 PARTIAL (env-only, Cypher correct) / 2 FAIL
3. Root Cause Analysis¶
FP-1: shortestPath() — SyntaxError (S4b) ✅ Resolved in v0.3.10 (PR #497)¶
Resolution: shortestPath() and allShortestPaths() now parse correctly and
raise NotImplementedError with a clear BFS workaround hint. llm-workflows.md
already uses the BFS pattern; the section heading now accurately says "not yet
supported".
Original problem: The parser had zero grammar rules for shortestPath. Any
query using it failed immediately at parse time with a confusing SyntaxError:
SyntaxError: Unexpected token Token('IDENTIFIER', 'shortestPath') at line 1, column 11.
Expected one of: ...
Previous doc claim (now corrected): docs/use-cases/llm-workflows.md Section 4
had presented this as a working pattern:
rows = db.execute("""
MATCH path = shortestPath(
(a:Entity {canonical: 'alice-chen'})-[*]-(b:Entity {canonical: 'decisionnerd'})
)
RETURN [n IN nodes(path) | n.name] AS chain
""")
The summary table also lists shortestPath(…) as a supported feature.
Workaround: BFS via variable-length path with ORDER BY length:
rows = db.execute("""
MATCH path = (a:Entity {canonical: $src})-[*1..5]-(b:Entity {canonical: $dst})
RETURN length(path) AS hops
ORDER BY hops ASC
LIMIT 1
""", {"src": "alice-chen", "dst": "decisionnerd"})
if rows:
print(f"Shortest path: {rows[0]['hops'].value} hops")
Caveat: This workaround enumerates all paths up to the max-hop limit. On dense graphs it triggers combinatorial explosion — impractical above ~2K nodes with 3+ hops (see Section 5). Use with a tight hop limit (≤ 3) or on sparse graphs only.
Tracking: #468 (parser fix). The network-analysis research doc (line 465) already noted this and flagged the llm-workflows doc. Issue #477 was filed to fix the doc.
FP-2: UNION Trailing ORDER BY Only Sorts Second Branch (S5)¶
Problem: build_context_for_question() uses a UNION to combine outbound and
inbound edges for a given entity, then applies ORDER BY confidence DESC LIMIT $limit
after the second branch:
rows = db.execute(
"MATCH (src:Entity {canonical: $canonical})-[r]->(dst:Entity) "
"RETURN src.name AS subject, type(r) AS predicate, dst.name AS object, "
" r.confidence AS confidence "
"UNION "
"MATCH (src:Entity)-[r]->(dst:Entity {canonical: $canonical}) "
"RETURN src.name AS subject, type(r) AS predicate, dst.name AS object, "
" r.confidence AS confidence "
"ORDER BY confidence DESC "
"LIMIT $limit",
{"canonical": entity_canonical, "limit": max_facts},
)
Root cause: The grammar defines union_statement: query (union_clause query)+.
Each query is clause+. ORDER BY is a clause within the second query, not a
post-UNION operator. The _execute_union() method in executor.py (line ~4042)
concatenates branch results without post-sort. Results are:
[branch_1_results_unsorted] ++ [branch_2_results_sorted_by_confidence]
Confirmed by experiment:
# outbound from seed: conf 0.3, 0.9 (insertion order)
# inbound to seed: conf 0.6
# Expected with global sort: [0.9, 0.6, 0.3]
# Actual: [0.3, 0.9, 0.6] ← branch1 unordered ++ branch2 sorted
For entities where outbound edges happen to have higher confidence than inbound edges,
the bug is invisible (results look correct). It only surfaces when outbound edges have
lower confidence than inbound, which is typical for high-confidence MAINTAINS or
FOUNDED relationships held by hub entities.
Fix — sort in Python:
rows = db.execute(
"MATCH (src:Entity {canonical: $canonical})-[r]->(dst:Entity) "
"RETURN src.name AS subject, type(r) AS predicate, dst.name AS object, "
" r.confidence AS confidence "
"UNION "
"MATCH (src:Entity)-[r]->(dst:Entity {canonical: $canonical}) "
"RETURN src.name AS subject, type(r) AS predicate, dst.name AS object, "
" r.confidence AS confidence",
{"canonical": entity_canonical},
)
# Sort combined results in Python, then limit
rows = sorted(rows, key=lambda r: r["confidence"].value, reverse=True)[:max_facts]
Or use to_dicts() and sort the plain dicts:
rows = db.to_dicts(
"MATCH (src:Entity {canonical: $canonical})-[r]->(dst:Entity) "
"RETURN src.name AS subject, type(r) AS predicate, dst.name AS object, "
" r.confidence AS confidence "
"UNION "
"MATCH (src:Entity)-[r]->(dst:Entity {canonical: $canonical}) "
"RETURN src.name AS subject, type(r) AS predicate, dst.name AS object, "
" r.confidence AS confidence",
{"canonical": entity_canonical},
)
rows = sorted(rows, key=lambda r: r["confidence"], reverse=True)[:max_facts]
FP-3: f-String Label Injection Risk (S2, S6a, S7)¶
Problem: The doc shows label and relationship type values interpolated directly into Cypher strings via f-strings:
# S2 — label from extraction result
db.execute(
f"MERGE (e:Entity:{label} {{canonical: $canonical}}) "
"ON CREATE SET e.name = $name ...",
...
)
# S6a — rel_type from extraction result
db.execute(
f"MATCH (src:Entity {{canonical: $src}})-[r:{rel_type}]->(dst:Entity {{canonical: $dst}}) "
"SET r.confidence = $conf ...",
...
)
In a real pipeline, label and rel_type come from LLM extraction output. LLMs
hallucinate arbitrary strings. A value like "Bad Label" (with a space) causes a
SyntaxError that crashes the pipeline silently — the document node was already created
but the entity MERGE fails, leaving an incomplete ingestion with no error surfaced to
the caller.
Labels cannot be parameterized in openCypher (it is a language-level limitation, not a GraphForge limitation). The fix is to validate before interpolation:
import re
_SAFE_IDENTIFIER = re.compile(r"^[A-Za-z_][A-Za-z0-9_]*$")
def safe_label(label: str) -> str:
"""Validate a label or relationship type before Cypher interpolation."""
if not _SAFE_IDENTIFIER.match(label):
raise ValueError(f"Unsafe Cypher identifier: {label!r}")
return label
# Usage
db.execute(
f"MERGE (e:Entity:{safe_label(label)} {{canonical: $canonical}}) ...",
...
)
Tested: Labels with spaces ("Bad Label"), SQL-style keywords ("DROP TABLE"),
and numeric starts ("123Start") all fail the regex, while "Person", "Org",
"MyEntity_2" pass.
FP-4: to_dicts() Unused Throughout¶
Every snippet in the doc uses the verbose .value accessor pattern:
for row in rows:
subj = row["subject"].value
pred = row["predicate"].value.replace("_", " ").lower()
obj = row["object"].value
conf = row["confidence"].value
to_dicts() (shipped in v0.3.7) auto-unwraps all CypherValue wrappers and returns
plain Python dicts. It has ≤0.8% overhead vs manual .value (measured in the
agent-grounding research). All snippets could be simplified:
rows = db.to_dicts(
"MATCH (src:Entity)-[r]->(dst:Entity) WHERE r.confidence >= 0.85 "
"RETURN src.name AS subject, type(r) AS predicate, dst.name AS object, "
" r.confidence AS confidence ORDER BY r.confidence DESC LIMIT 20"
)
for row in rows: # row is a plain dict, values are Python types
print(f"{row['subject']:25s} --{row['predicate']}--> {row['object']:25s}")
4. UNION Semantics Deep-Dive¶
How GraphForge Processes UNION¶
The grammar defines:
union_statement: query (union_clause query)+
union_clause: "UNION"i "ALL"i?
query: clause+
Each query is a self-contained pipeline of clauses (MATCH, WHERE, RETURN, ORDER BY,
LIMIT, SKIP). An ORDER BY that appears after the second UNION branch is parsed as
a clause within that branch, not as a post-UNION operator.
_execute_union() in executor.py:
1. Plans each branch independently (each gets its own Sort operator if it has ORDER BY)
2. Executes each branch in order
3. Concatenates the result lists: results = branch1_results + branch2_results
4. If UNION (not UNION ALL), deduplicates the concatenated list
There is no mechanism to attach a sort to the combined output. The Union operator
class has no sort_items field.
Impact on build_context_for_question()¶
The function retrieves both outgoing and incoming edges for an entity, intending to rank all facts by confidence for the LLM context window. The bug means:
- Branch 1 (outbound) results arrive in arbitrary insertion order
- Branch 2 (inbound) results arrive sorted by confidence DESC
- The LLM receives branch-1 facts first, regardless of their confidence values
For most test data the bug is invisible (branch-1 happens to be higher confidence than branch-2). In realistic knowledge graphs where hub entities accumulate many low-confidence outbound edges from early ingestion, the highest-confidence inbound edges (human-verified facts) will appear after the low-confidence outbound ones.
Alternatives Considered¶
| Approach | Supported | Global sort | Notes |
|---|---|---|---|
ORDER BY in each branch |
✅ | ✗ (per-branch only) | Not useful for combined ranking |
ORDER BY after UNION |
✅ (applies to 2nd branch) | ✗ | The current doc bug |
Python sorted() post-query |
✅ | ✅ | Recommended |
CALL { UNION } RETURN ... ORDER BY |
✗ not supported | ✅ | Would require subquery support |
Python-side sort is the correct fix. It adds negligible overhead (sorting 30 rows in Python is microseconds).
5. Neighbourhood Expansion Scalability¶
Query Pattern¶
The get_neighbourhood() function from the doc:
rows = db.execute(
"MATCH (seed:Entity {canonical: $canonical})-[r*1.." + str(hops) + "]-(neighbour:Entity) "
"WHERE neighbour.canonical <> $canonical "
"RETURN DISTINCT neighbour.canonical AS canonical, neighbour.name AS name, "
" labels(neighbour) AS labels",
{"canonical": canonical},
)
This is an undirected variable-length pattern — it traverses both directions at each
hop. The DISTINCT prevents duplicate nodes from multiple paths.
Latency Results¶
Benchmarked with scripts/benchmark_neighbourhood.py against synthetic random graphs
(uniform random edge assignment, avg degree ~4 for sparse and ~10 for medium). Seed
node is e0. p50 over 10 iterations.
Benchmarked with scripts/benchmark_neighbourhood.py. Seed node is e0. p50 / p95 over
10 iterations; n is result-set size.
Sparse topology (avg degree ~4):
| Config | 1-hop p50 | 1-hop n | 2-hop p50 | 2-hop n | 3-hop p50 | 3-hop n |
|---|---|---|---|---|---|---|
| 1K / 2K edges | 2.0 ms | 0 | 1.9 ms | 0 | 2.1 ms | 0 |
| 5K / 10K edges | 9.3 ms | 5 | 9.4 ms | 20 | 9.5 ms | 69 |
| 10K / 20K edges | 20.1 ms | 4 | 19.2 ms | 15 | 20.2 ms | 53 |
Medium topology (avg degree ~10):
| Config | 1-hop p50 | 1-hop n | 2-hop p50 | 2-hop n | 3-hop p50 | 3-hop n |
|---|---|---|---|---|---|---|
| 1K / 5K edges | 2.1 ms | 6 | 2.2 ms | 68 | 4.5 ms | 496 |
| 5K / 25K edges | 9.5 ms | 12 | 10.0 ms | 125 | 14.4 ms | 1106 |
Observations:
- The seed node
e0in a random sparse graph has zero reachable neighbours at 1–3 hops (the graph's random wiring happens to isolate e0). The latency is the scan overhead, not the traversal. For hub nodes with real connections, expect similar latency until result sets grow large. - Latency is dominated by graph size (scan cost), not result-set size, at sparse densities. At medium density (5K/25K edges, 3-hop n=1106) it rises to ~15 ms — still acceptable for a RAG context step.
- At 10K nodes (sparse), 3-hop latency is ~20 ms p50 — still fast for a RAG step. The doc's claim of "10,000+ nodes without configuration changes" holds for neighbourhood queries.
- Dense graphs returning 1000+ neighbours in 3 hops start to stress DISTINCT
(5K/25K, 3-hop, n=1106 → ~14 ms). Add a
LIMITclause to bound result size before it reaches tens of thousands of rows.
Build Strategy Comparison (1K nodes / 2K edges)¶
1K nodes / 2K edges comparison:
| Strategy | Build time | Throughput | Notes |
|---|---|---|---|
A: execute(MERGE) — doc pattern |
7.5 s | ~360 ops/s | Parse+plan+exec per call |
B: create_node_bulk + create_relationship_bulk in bulk_ingest() |
0.01 s | ~300,000 ops/s | Skips Cypher parse + Pydantic validation |
C: execute(CREATE) — no dedup |
7.4 s | ~370 ops/s | Similar to A; tiny MATCH scan saving |
Build time by scale (Strategy A, execute(MERGE)):
| Config | Build time | Throughput |
|---|---|---|
| 1K / 2K edges | 8 s | ~360 ops/s |
| 1K / 5K edges | 21 s | ~290 ops/s |
| 5K / 10K edges | 200 s | ~75 ops/s |
| 5K / 25K edges | 500 s | ~60 ops/s |
| 10K / 20K edges | 800 s | ~37 ops/s |
Critical finding: Build throughput degrades significantly with graph size due to
the MATCH-based deduplication in MERGE requiring a full label scan. At 10K nodes,
ingestion via execute(MERGE) takes ~13 minutes for 30K operations. Strategy B (bulk
API) is ~780× faster by skipping Cypher parsing and Pydantic validation entirely.
However, Strategy B requires pre-collected NodeRef objects (returned from
create_node_bulk()), making it unsuitable for incremental document-by-document
ingestion where entity references arrive piecemeal. Strategy A is the natural pattern
for LLM extraction workflows; Strategy B is appropriate for batch ETL (load a
pre-assembled dataset in one pass).
BFS shortestPath Workaround¶
rows = db.execute(
"MATCH path = (a:Entity {canonical: $src})-[*1..5]-(b:Entity {canonical: $dst}) "
"RETURN length(path) AS hops "
"ORDER BY hops ASC LIMIT 1",
{"src": "alice", "dst": "decisionnerd"},
)
| Config | max_hops | p50 ms | p95 ms | Found |
|---|---|---|---|---|
| 1K sparse | 3 | 2.3 ms | 3.1 ms | not found (e0→e100 not connected) |
| 1K sparse | 5 | 2.2 ms | 2.4 ms | not found |
| 5K sparse | 3 | 10.9 ms | 11.1 ms | not found |
| 5K sparse | 5 | 15.9 ms | 16.1 ms | not found |
The query executes quickly when no path exists (early termination). The synthetic graph
used for benchmarking uses uniform random wiring; e0 and e100 happen not to be
connected at ≤5 hops in a sparse (avg degree 4) graph. In a real knowledge graph with
hub entities (high-degree canonical entities), paths would be found and latency would
reflect result materialisation rather than early exit.
Verdict: The BFS query itself adds ≤16 ms overhead at 5K nodes. The practical
concern is result set explosion, not raw latency — a query like [*1..5] on a
high-degree graph can enumerate millions of paths. Use with an explicit LIMIT 1 and
only on graphs where you control the maximum degree. For production path-finding, wait
for #468 (native shortestPath).
6. Transaction and Bulk Ingestion Semantics¶
How Transactions Work¶
begin() takes a full deep copy of the entire in-memory graph state via
copy.deepcopy() (storage/memory.py, snapshot() at line ~486). This copies all
nodes, edges, adjacency lists, label/type/property indexes, and statistics counters.
Measured overhead:
Measured with scripts/benchmark_neighbourhood.py (Phase 3 section):
| Graph size | begin() ms (p50) |
commit() ms (p50) |
rollback() ms (p50) |
|---|---|---|---|
| 0 nodes / 0 edges | 0.00 | 0.00 | 0.00 |
| 100 nodes / 150 edges | 3.83 | 0.07 | 0.07 |
| 1K nodes / 1.5K edges | 41.83 | 1.02 | 0.94 |
| 5K nodes / 7.5K edges | 265.34 | 7.82 | 8.33 |
begin() is O(n+e) — linear with graph size. At 5K nodes it takes 265 ms.
commit() and rollback() are fast (sub-10 ms) because they only write/restore the
snapshot, not re-derive it.
Implication: begin() doubles the memory footprint of the graph. At 10K nodes,
calling begin() per-document in a 100-document ingestion loop adds ~1 s of overhead
from repeated deep copies alone.
Rollback Correctness¶
Verified: ingest 2 nodes inside a transaction, rollback, graph reverts to pre-transaction state. The snapshot is a complete copy — not a diff or WAL — so rollback is always total.
Ingestion Strategy Benchmarks (100 documents)¶
100 documents, each with 3–5 entity MERGEs and 2–4 relationship MERGEs:
| Strategy | Time (ms) | Overhead vs (a) |
|---|---|---|
| (a) No transaction | 637 ms | — |
(b) One begin()/commit() per document |
2158 ms | +239% |
© One begin()/commit() for entire batch |
618 ms | −3% |
Findings:
- Per-document transactions add ~239% overhead due to repeated deep-copy snapshots
(100 × begin() on a growing graph).
- Batch transaction (wrap entire ingestion in one begin()/commit()) adds −3%
overhead vs no-transaction — the batch snapshot is taken once on an empty graph.
- Recommendation: use a single batch transaction per ingestion job, not per
document. If atomicity per document is needed, consider checkpointing by document
batch (every 10–20 documents).
bulk_ingest() Assessment¶
bulk_ingest() (defined in api.py, lines 548–567) only sets _defer_stats = True,
which skips per-edge avg_degree and unique_source tracking. On exit it calls
_flush_deferred_stats() which rebuilds these in O(E) time.
It does not:
- Batch SQLite writes (no SQLite interaction during individual execute() calls)
- Defer label/property/type indexing (these update per-insert)
- Disable Pydantic validation on execute()-based ingestion
Measured speedup when wrapping execute(MERGE) calls: ~1.0× (no consistent
benefit — two independent runs yielded 1.13× and 0.79×, within measurement noise)
The dominant cost is parse → plan → execute per Cypher string. Stats deferral is a
tiny fraction of that. bulk_ingest() benefits only when using the lower-level
create_node_bulk() / create_relationship_bulk() APIs, which skip Pydantic
validation and avoid Cypher parsing entirely.
Naming issue: The name bulk_ingest() implies batch write optimization. The
actual benefit is limited to statistics tracking deferral. Users combining it with
execute(MERGE) calls will see no improvement and may be misled into assuming
atomicity (it is not a transaction wrapper).
7. Integration Patterns Assessment¶
LangChain (S8)¶
The doc imports from langchain.schema import Document as LCDocument. This import
path was deprecated in LangChain 0.1.x; the current path is
from langchain_core.schema import Document. The Cypher query pattern is correct:
rows = db.execute(
"MATCH (e:Entity {canonical: $canonical})-[r]->(other:Entity) "
"RETURN e.name AS subject, type(r) AS predicate, other.name AS object",
{"canonical": query_entity},
)
Replace .value access with to_dicts() for cleaner integration.
LlamaIndex (S9)¶
The GraphForgeRetriever pattern (inheriting BaseRetriever) is correct for
llama_index.core >= 0.10. The Cypher query for entity lookup:
rows = self._db.execute(
"MATCH (e:Entity) WHERE toLower(e.name) CONTAINS toLower($q) "
"RETURN e.canonical AS canonical LIMIT 1",
{"q": query_bundle.query_str},
)
This is a keyword match. For production, replace with embedding-based lookup (the doc
notes this). The canonical lookup into build_context_for_question() chain works.
Generic Pattern (S10)¶
The prompt construction pattern is framework-agnostic and correct:
context = build_context_for_question(db, entity_canonical)
messages = [
{"role": "system", "content": "Answer using only the facts provided."},
{"role": "user", "content": f"{context}\n\nQuestion: {question}"},
]
Once build_context_for_question() is fixed (Python-side sort), this pattern works
with any LLM client (OpenAI, Anthropic, Cohere, Ollama, etc.).
Should neighbourhood() or ContextBuilder Ship?¶
Assessment: No — not as core API.
get_neighbourhood() and build_context_for_question() are schema-dependent: they
assume Entity labels and a canonical property. Shipping them in the library would
bake in a specific schema that many users will not use. The patterns are short enough
(10–20 lines each) to copy from the doc.
Recommendation: Keep them as documented recipes in the use-case doc. Add a
separate examples/llm_workflows/ notebook. If demand emerges, consider a
graphforge.recipes module (not part of the core API surface).
8. Recommendations¶
| # | Recommendation | Priority | Effort | Type |
|---|---|---|---|---|
| R-1 | Fix build_context_for_question() — replace trailing ORDER BY with Python sorted() |
High | Low | Doc fix |
| R-2 | Fix shortestPath() example — replace with BFS workaround + caveat, reference #468 |
High | Low | Doc fix |
| R-3 | Add safe_label() / safe_identifier() helper to all f-string interpolation examples |
High | Low | Doc enhancement |
| R-4 | Replace all .value access in the doc with to_dicts() |
Medium | Low | Doc enhancement |
| R-5 | Update bulk_ingest() docstring to clarify it only defers statistics |
Medium | Low | API clarification |
| R-6 | Add transaction guidance: recommend batch transaction (one per ingestion job) over per-document | Medium | Low | Doc enhancement |
| R-7 | Update LangChain import from langchain.schema to langchain_core.schema |
Low | Low | Doc fix |
| R-8 | Implement shortestPath() in parser/planner/executor |
Medium | High | Feature (#468) |
| R-9 | Add post-UNION ORDER BY support (global sort after UNION) |
Low | Medium | Feature |
9. Issues Filed¶
| Issue | Title | Priority |
|---|---|---|
| #488 | docs: fix build_context_for_question() — UNION trailing ORDER BY sorts only second branch |
High |
| #489 | docs: replace shortestPath() in llm-workflows.md — parser raises SyntaxError |
High |
| #490 | docs: add safe_identifier() pattern for f-string label/rel_type injection |
Medium |
| #491 | api: clarify bulk_ingest() docstring — defers statistics only, not batch writes |
Medium |
| #492 | feat: gf.neighbourhood(canonical, hops) convenience method for n-hop expansion |
Low |
Appendix A: Corrected Snippets¶
A-1: get_neighbourhood() — no changes needed (works as documented)¶
def get_neighbourhood(db, canonical: str, hops: int = 2) -> list[dict]:
"""Return all entities reachable within `hops` steps from `canonical`."""
return db.to_dicts(
"MATCH (seed:Entity {canonical: $canonical})-[r*1.." + str(hops) + "]-(neighbour:Entity) "
"WHERE neighbour.canonical <> $canonical "
"RETURN DISTINCT "
" neighbour.name AS name, "
" neighbour.canonical AS canonical, "
" labels(neighbour) AS labels",
{"canonical": canonical},
)
A-2: build_context_for_question() — fix UNION ORDER BY¶
def build_context_for_question(db, entity_canonical: str, max_facts: int = 30) -> str:
"""Build a compact context string from the graph for use in an LLM prompt."""
# Fetch combined outbound + inbound edges (no ORDER BY — apply in Python)
rows = db.to_dicts(
"MATCH (src:Entity {canonical: $canonical})-[r]->(dst:Entity) "
"RETURN src.name AS subject, type(r) AS predicate, dst.name AS object, "
" r.confidence AS confidence "
"UNION "
"MATCH (src:Entity)-[r]->(dst:Entity {canonical: $canonical}) "
"RETURN src.name AS subject, type(r) AS predicate, dst.name AS object, "
" r.confidence AS confidence",
{"canonical": entity_canonical},
)
if not rows:
return "No relevant facts found in the knowledge graph."
# Sort combined results globally, then limit
rows = sorted(rows, key=lambda r: r["confidence"], reverse=True)[:max_facts]
lines = ["Relevant facts from the knowledge graph:\n"]
for row in rows:
pred = row["predicate"].replace("_", " ").lower()
conf = row["confidence"]
lines.append(f" - {row['subject']} {pred} {row['object']} (confidence: {conf:.0%})")
return "\n".join(lines)
A-3: shortestPath() BFS workaround¶
import re
_SAFE_IDENTIFIER = re.compile(r"^[A-Za-z_][A-Za-z0-9_]*$")
def safe_identifier(value: str) -> str:
"""Validate a label or relationship type before Cypher f-string interpolation."""
if not _SAFE_IDENTIFIER.match(value):
raise ValueError(f"Unsafe Cypher identifier: {value!r}")
return value
def find_shortest_path(db, src_canonical: str, dst_canonical: str, max_hops: int = 4):
"""
BFS-based shortest path workaround for shortestPath() (#468).
Only practical for small graphs (< 2K nodes) or tight hop limits (<= 3).
"""
rows = db.to_dicts(
f"MATCH path = (a:Entity {{canonical: $src}})-[*1..{max_hops}]-(b:Entity {{canonical: $dst}}) "
"RETURN length(path) AS hops "
"ORDER BY hops ASC LIMIT 1",
{"src": src_canonical, "dst": dst_canonical},
)
return rows[0]["hops"] if rows else None
A-4: Recommended ingestion pattern with transaction and injection safety¶
import re
from graphforge import GraphForge
_SAFE_IDENTIFIER = re.compile(r"^[A-Za-z_][A-Za-z0-9_]*$")
def safe_id(value: str) -> str:
if not _SAFE_IDENTIFIER.match(value):
raise ValueError(f"Unsafe Cypher identifier: {value!r}")
return value
def ingest_documents(db: GraphForge, documents: list[dict], model: str = "mock-v1"):
"""
Process documents with a single batch transaction for atomicity.
Validates label/rel_type identifiers before interpolation.
"""
db.begin()
try:
for doc in documents:
db.execute(
"MERGE (d:Document {id: $id}) "
"ON CREATE SET d.title = $title, d.ingested = datetime() "
"ON MATCH SET d.title = $title",
{"id": doc["id"], "title": doc["title"]},
)
extraction = extract(doc["text"]) # your LLM call here
for ent in extraction["entities"]:
label = safe_id(ent["label"]) # validate before interpolation
db.execute(
f"MERGE (e:Entity:{label} {{canonical: $canonical}}) "
"ON CREATE SET e.name = $name, e.created = datetime() "
"ON MATCH SET e.name = $name",
{"canonical": ent["canonical"], "name": ent["name"]},
)
db.execute(
"MATCH (d:Document {id: $doc_id}), (e:Entity {canonical: $canonical}) "
"MERGE (d)-[:MENTIONS]->(e)",
{"doc_id": doc["id"], "canonical": ent["canonical"]},
)
for rel in extraction["relationships"]:
rel_type = safe_id(rel["type"]) # validate before interpolation
db.execute(
"MATCH (src:Entity {canonical: $src}), (dst:Entity {canonical: $dst}) "
f"MERGE (src)-[r:{rel_type}]->(dst) "
"ON CREATE SET r.confidence = $conf, r.model = $model, "
" r.created = datetime() "
"ON MATCH SET r.confidence = $conf, r.model = $model, "
" r.updated = datetime()",
{
"src": rel["src"],
"dst": rel["dst"],
"conf": rel["confidence"],
"model": model,
},
)
db.commit()
except Exception:
db.rollback()
raise
Appendix B: Benchmark Methodology¶
Environment: macOS 15.4, Apple Silicon. All timings are wall-clock (single
process, no parallelism). GraphForge is in-memory (GraphForge()) unless noted.
Neighbourhood benchmark: scripts/benchmark_neighbourhood.py. Graphs are random
Erdős–Rényi with fixed seed=42. Latency is median (p50) over 10 iterations from the
same seed node (e0).
Transaction benchmark: Inline script (not committed — see Phase 3 probe). 100 synthetic documents with 3–5 entity MERGEs and 2–4 relationship MERGEs per document. Entities drawn from a pool of 50 canonical IDs (many collisions, exercises ON MATCH path). Timing is total wall-clock for all 100 documents.
bulk_ingest() speedup: 500 nodes + 500 edges, execute(MERGE) pattern (not
create_node_bulk). Measured with and without bulk_ingest() context manager.