Research: Network Analysis Notebook Workflow and pandas/NetworkX Bridge Gaps¶

Issue: #450
Date: 2026-05-04
Branch: docs/450-network-analysis-research
Scope: Findings-only. No library code changes.

1. Executive Summary¶

docs/use-cases/network-analysis.md guides users through loading the SNAP ego-facebook dataset, running Cypher graph metrics, and bridging to pandas/NetworkX for visualization and algorithms. Since that doc was written, PR #465 shipped to_dicts(), to_dataframe(), to_networkx(), and to_igraph() — four first-class export methods that eliminate almost all of the manual boilerplate the doc shows.

Key findings:

Two classes of schema mismatch make ~10 of 22 doc snippets fail as written: the doc uses :Person/:FRIEND_OF but the snap-ego-facebook CSV loader produces :Node/:CONNECTED_TO. After correcting labels, all underlying Cypher patterns work except one parser bug (S16).
Three Cypher bugs found: (1) ORDER BY on an unaliased variable after RETURN DISTINCT raises UndefinedVariable; (2) re-using a variable name across a WITH boundary after it leaves scope raises KeyError; (3) the original S10 KeyError('a') is the same variable-reuse bug.
to_networkx() and to_igraph() work well on the full 4 039-node / 88 234-edge graph. Export takes ~130 ms and ~120 ms respectively.
Node IDs in exported graphs are internal integers, not user id properties. Writing NX/igraph algorithm results back to GraphForge requires a manual lookup table or an UNWIND loop. Per-node UNWIND write-back of 4 039 PageRank values takes ~60–80 s — critically slow.
Triangle count (S11) on the full SNAP graph did not complete in the test window (>5 min). This is an expected scalability limit for the in-Python executor; users must delegate triangle counting to NetworkX or igraph.
igraph is dramatically faster than NetworkX for equivalent algorithms: betweenness (0.19 s vs ~45 s full, or ~0.35 s sampled with k=200), Louvain (0.03 s vs 0.35 s), transitivity/clustering (0.006 s vs 1.0 s).
Competitive gap: Neo4j GDS provides built-in PageRank, Louvain, and betweenness with in-memory projection and write-back; GraphForge requires export to NX/igraph. igraph standalone loads from files in one step. KùzuDB provides similar embedded-graph + Cypher workflow but without a curated dataset library.

6 friction points identified, 3 new issues to file.

2. Code Snippet Pass/Fail Matrix¶

All snippets from docs/use-cases/network-analysis.md run against main as of 2026-05-04.

Validation methodology: Each snippet was run with the doc's original labels, then re-run with corrected labels (Node/CONNECTED_TO) to distinguish "fails because wrong label" from "fails because feature doesn't work."

ID	Section	Snippet (abbreviated)	Status (doc labels)	Status (corrected)	Root Cause
S1	Loading	`load_dataset(db, "snap-ego-facebook")`	✅ PASS	—	—
S2	Loading	`count(n)` / `count(r)`	✅ PASS (4039/88234)	—	—
S3	Loading	`labels(n)` returning `['Person']`	❌ SCHEMA_MISMATCH	N/A — doc is wrong	Loader hardcodes `Node`
S4	Loading	`type(r)` returning `'FRIEND_OF'`	❌ SCHEMA_MISMATCH	N/A — doc is wrong	Loader hardcodes `CONNECTED_TO`
S5	Degree	Out-degree `:Person`-`[:FRIEND_OF]`	❌ 0 rows	✅ PASS	Schema mismatch only
S6	Degree	Top hubs by total degree	❌ 0 rows	✅ PASS (top: node 107, degree=1045)	Schema mismatch only
S7	Path	`(a:Person {id:'0'})-[*1..3]-(b:Person {id:'107'})`	❌ 0 rows	✅ PASS (28 paths)	Schema mismatch only
S8	Path	`length(p)` on `[:FRIEND_OF*1..4]` path	❌ 0 rows	✅ PASS (shortest=1 hop)	Schema mismatch only
S9	Path	`approximate_distance()` BFS	❌ wrong label in function	✅ PASS (distance(0,107)=1)	Schema mismatch only
S10	Community	Jaccard similarity (multi-MATCH)	❌ KeyError('a')	✅ PASS (jaccard=0.0077) with variable rebind	Parser bug — variable reuse across WITH
S11	Community	Triangle count on full SNAP graph	❌ timeout (>5 min)	⏱ TIMEOUT (still running)	Performance — O(n³) in Python executor
S12	Community	Bridge nodes (pattern comprehension)	Pending S11 result	Pending S11 result	—
S13	Citation	`CREATE (:Paper)` / `CREATE (:CITES)`	✅ PASS	—	—
S14	Citation	Most-cited papers	✅ PASS	—	—
S15	Citation	Co-citation pairs (`id(a) < id(b)`)	✅ PASS (2 pairs)	—	—
S16	Citation	`MATCH … RETURN DISTINCT … ORDER BY ancestor.year`	❌ UndefinedVariable	Same bug	Parser bug — ORDER BY on non-projected var
S17	Temporal	`CREATE (:User)` / `CREATE (:FOLLOWS)`	✅ PASS	—	—
S18	Temporal	Date filtering, `CASE` periods	✅ PASS	—	—
S19	Persist	`SET n.degree = degree`	❌ 0 rows (wrong label)	✅ PASS (4039 nodes updated)	Schema mismatch only
S20	Pandas	Manual `.value` unwrap helper	✅ PASS but OUTDATED	—	`gf.to_dataframe()` replaces this
S21	NX	Manual `nx.DiGraph()` loop	✅ PASS but OUTDATED	—	`gf.to_networkx()` replaces this
S22	Stats	`stDev(n.degree)`	✅ PASS (std=52.42)	—	—

Summary: 10 schema-mismatch failures (all corrected after label fix), 2 parser bugs, 1 performance issue, 2 outdated patterns.

S10 Variable-Reuse Bug (confirmed)¶

The doc's Jaccard query reuses variable names a and b after they leave scope through a WITH boundary:

# FAILS — 'a' and 'b' are out of scope after first WITH
db.execute("""
    MATCH (a:Node {id: '107'})-[:CONNECTED_TO]-(common)-[:CONNECTED_TO]-(b:Node {id: '1684'})
    WITH count(DISTINCT common) AS shared
    MATCH (a:Node {id: '107'})-[:CONNECTED_TO]-(na)   # 'a' reused after WITH
    ...
""")
# → KeyError: 'a'

Workaround: use fresh variable names in each MATCH after a WITH boundary:

db.execute("""
    MATCH (a1:Node {id: '107'})-[:CONNECTED_TO]-(common)-[:CONNECTED_TO]-(b1:Node {id: '1684'})
    WITH count(DISTINCT common) AS shared
    MATCH (a2:Node {id: '107'})-[:CONNECTED_TO]-(na)
    WITH shared, count(DISTINCT na) AS deg_a
    MATCH (b2:Node {id: '1684'})-[:CONNECTED_TO]-(nb)
    WITH shared, deg_a, count(DISTINCT nb) AS deg_b
    RETURN toFloat(shared) / (deg_a + deg_b - shared) AS jaccard
""")
# → jaccard = 0.0077

S16 ORDER BY After RETURN DISTINCT Bug (confirmed)¶

ORDER BY cannot reference unaliased source variables after RETURN DISTINCT:

# FAILS — 'ancestor' not projected by RETURN DISTINCT
db.execute("""
    MATCH (root:Paper {id: 'p5'})-[:CITES*1..5]->(ancestor:Paper)
    RETURN DISTINCT ancestor.title AS title, ancestor.year AS year
    ORDER BY ancestor.year   # ← UndefinedVariable
""")

Workaround: use the alias:

db.execute("""
    MATCH (root:Paper {id: 'p5'})-[:CITES*1..5]->(ancestor:Paper)
    RETURN DISTINCT ancestor.title AS title, ancestor.year AS year
    ORDER BY year   # ← alias works
""")

3. Full Pipeline Test¶

Environment¶

GraphForge main branch (2026-05-04)
Python 3.13, NetworkX 3.x, igraph 0.11.x, scipy 1.17.1
Hardware: Apple Silicon (arm64), 16 GB RAM

Stage 1: Load¶

from graphforge import GraphForge
from graphforge.datasets import load_dataset

db = GraphForge()
load_dataset(db, "snap-ego-facebook")
# 4039 nodes, 88234 edges

Metric	Value
Load time	0.40 s
Node count	4 039
Edge count	88 234
Schema	`:Node`, `:CONNECTED_TO`

Stage 2: Cypher Analytics¶

# Top-10 hubs by undirected degree
hubs = db.execute("""
    MATCH (n:Node)
    OPTIONAL MATCH (n)-[:CONNECTED_TO]-(neighbor)
    WITH n, count(DISTINCT neighbor) AS degree
    RETURN n.id AS node_id, degree
    ORDER BY degree DESC LIMIT 10
""")

Query	Time
Count nodes/edges	0.11 s
Top-10 hubs	0.49 s
Set degree on all nodes	~1.5 s

Top hubs: node 107 (degree 1 045), 1684 (792), 1912 (755).

Stage 3: `to_dataframe()` Pipeline¶

df = db.to_dataframe("""
    MATCH (n:Node)
    OPTIONAL MATCH (n)-[:CONNECTED_TO]-(neighbor)
    WITH n, count(DISTINCT neighbor) AS degree
    RETURN n.id AS node_id, degree
    ORDER BY degree DESC
""")
# df.shape = (4039, 2)
# df["degree"].dtype = int64  ← no CypherValue leakage
# df["node_id"].dtype = str (object)

Metric	Value
Export time	0.79 s
Shape	(4 039, 2)
`degree` dtype	int64
`node_id` dtype	str
CypherValue leakage	None

to_dataframe() is fully matplotlib-ready. No .value unwrapping needed.

Stage 4: `to_networkx()` Pipeline¶

import networkx as nx

G = db.to_networkx()
# type: nx.DiGraph, 4039 nodes, 88234 edges, 0.134s

Node attributes: _labels, id
Edge attributes: _type

Algorithm	Time	Notes
`to_networkx()`	0.134 s	Full graph
`degree_centrality`	0.001 s	—
`pagerank` (scipy)	1.40 s	max_iter=100
`betweenness` (k=200)	0.35 s	Sampled; exact would take ~45 s
`average_clustering` (undirected)	1.01 s	Value = 0.6055
`louvain_communities`	0.35 s	16 communities
`connected_components`	0.003 s	1 component, all 4039 nodes

Note: nx.pagerank() requires scipy. NX will raise ModuleNotFoundError if scipy is not installed.

Community detection requires undirected graph:

G_und = G.to_undirected()
communities = nx.algorithms.community.louvain_communities(G_und, seed=42)
# 16 communities, largest: 548, 535, 435, 431, 423 nodes

Stage 5: `to_igraph()` Pipeline¶

import igraph as ig

G_ig = db.to_igraph()
# directed=True, 4039 vertices, 88234 edges, 0.122s

Vertex attributes: _gf_id, _labels, id
Edge attributes: _type

Algorithm	Time	Notes
`to_igraph()`	0.122 s	Full graph
`pagerank()`	0.001 s	—
`betweenness()`	0.194 s	Exact (vs NX ~45 s)
`community_multilevel()`	0.030 s	Louvain, undirected only
`transitivity_undirected()`	0.006 s	Value = 0.5192
`closeness()`	0.741 s	—
`clusters()`	< 0.001 s	1 component

Community detection requires undirected conversion:

G_und = G_ig.as_undirected()
communities = G_und.community_multilevel()
# 15 communities, largest: 554, 548, 448, 430, 423 nodes

igraph is 10–100x faster than NetworkX on the same algorithms at this scale. For betweenness: igraph exact (0.19 s) vs NetworkX sampled k=200 (0.35 s).

Stage 6: Write-Back¶

Writing algorithm results (e.g., PageRank) back to GraphForge node properties:

import networkx as nx

G = db.to_networkx()
pr = nx.pagerank(G)

# Method 1: per-node loop
for nid, score in pr.items():
    db.execute("MATCH (n) WHERE id(n) = $nid SET n.pagerank = $pr",
               {"nid": int(nid), "pr": float(score)})
# Time: 58 s for 4039 nodes  ← extremely slow

# Method 2: UNWIND batch
items = [{"nid": int(k), "pr": float(v)} for k, v in pr.items()]
db.execute("""
    UNWIND $items AS item
    MATCH (n) WHERE id(n) = item.nid
    SET n.pagerank = item.pr
""", {"items": items})
# Time: 79 s for 4039 nodes  ← also slow (single transaction overhead)

Both methods are critically slow (~60–80 s for 4 039 updates). The bottleneck is the per-row Cypher execution overhead. There is no bulk property-set method in the current API.

ID mapping: to_networkx() exposes nodes as internal integer IDs. User id properties are available as node attributes but NX algorithm results are keyed by internal ID. The UNWIND pattern above works because id(n) = internal_id is an O(1) lookup via the property index. However, users who want to join NX results to their application data by user id must build a lookup table themselves.

Boilerplate Comparison¶

Old manual pattern (from doc, ~12 lines):

import pandas as pd

raw = db.execute("MATCH (n:Person) RETURN n.degree AS degree ORDER BY degree")

def to_dataframe(results):
    rows = []
    for row in results:
        rows.append({key: val.value for key, val in row.items()})
    return pd.DataFrame(rows)

df = to_dataframe(raw)

New API (2 lines):

df = db.to_dataframe("MATCH (n:Node) RETURN n.degree AS degree ORDER BY degree")

Old NX construction (from doc, ~7 lines):

import networkx as nx

edges_raw = db.execute("MATCH (a:Person)-[:FRIEND_OF]->(b:Person) RETURN a.id AS src, b.id AS dst LIMIT 5000")
G = nx.DiGraph()
for row in edges_raw:
    G.add_edge(row['src'].value, row['dst'].value)

New API (1 line, full graph, with node attributes):

G = db.to_networkx()

4. Friction Points¶

FP-1: Dataset schema mismatch — doc uses wrong labels¶

Problem: docs/use-cases/network-analysis.md uses :Person / :FRIEND_OF throughout, but load_dataset(db, "snap-ego-facebook") produces :Node / :CONNECTED_TO. Every SNAP-specific snippet in the doc fails without any error message — queries simply return 0 rows.

Root cause: The CSV loader (src/graphforge/datasets/loaders/csv.py, line 142) hardcodes ["Node"] and "CONNECTED_TO". The dataset metadata in snap.json confirms these are the actual labels.

Impact: All ~10 SNAP-specific snippets in the doc fail silently on first run.

Runnable verification:

from graphforge import GraphForge
from graphforge.datasets import load_dataset

db = GraphForge()
load_dataset(db, "snap-ego-facebook")

r = db.execute("MATCH (n) RETURN DISTINCT labels(n) AS lbls LIMIT 1")
print(r[0]["lbls"].value)   # ['Node']  — NOT ['Person'] as the doc says

r = db.execute("MATCH ()-[r]->() RETURN DISTINCT type(r) AS t LIMIT 1")
print(r[0]["t"].value)       # 'CONNECTED_TO'  — NOT 'FRIEND_OF'

Decision needed (HITL Checkpoint 1): Fix the doc to use correct labels (Node/CONNECTED_TO), or update the CSV loader to rename labels to match the doc (Person/FRIEND_OF)?

FP-2: Doc shows obsolete manual export patterns¶

Problem: The pandas and NetworkX sections show manual .value unwrapping and nx.DiGraph() loop construction that to_dataframe() and to_networkx() (PR #465) already replace.

Impact: New users copy-paste 12–20 lines of boilerplate that is now unnecessary.

Old pattern:

def to_dataframe(results):
    rows = []
    for row in results:
        rows.append({key: val.value for key, val in row.items()})
    return pd.DataFrame(rows)

raw = db.execute("MATCH (n:Person) RETURN n.degree AS degree")
df = to_dataframe(raw)

New API:

df = db.to_dataframe("MATCH (n:Node) RETURN n.degree AS degree")

FP-3: `to_networkx()` and `to_igraph()` always return directed graphs¶

Problem: Both methods return a DiGraph / directed igraph regardless of the data. Most community detection algorithms (Louvain, label propagation, modularity) require an undirected graph. Users must call .to_undirected() / .as_undirected() themselves, and this is not documented.

Impact: Calling nx.algorithms.community.louvain_communities(G) on the directed export raises no helpful error — it silently produces incorrect results in some backends or fails with ValueError: input graph must be undirected in igraph.

Runnable verification:

import networkx as nx
from graphforge import GraphForge

G = GraphForge().to_networkx()  # nx.DiGraph
# This will fail or produce wrong results:
# nx.algorithms.community.louvain_communities(G)
# ValueError (igraph): input graph must be undirected

# Required workaround:
G_und = G.to_undirected()
communities = nx.algorithms.community.louvain_communities(G_und, seed=42)

Method signature enhancement (R-3):

# Proposed
db.to_networkx(directed=False)  # returns nx.Graph
db.to_igraph(directed=False)    # returns undirected igraph

FP-4: Write-back is critically slow — no bulk property-set API¶

Problem: Writing algorithm results (e.g., PageRank for 4 039 nodes) back to GraphForge requires a Cypher loop. Both per-node and UNWIND approaches take ~60–80 s. This makes the "compute in NX/igraph, persist back to GF" pattern impractical for graphs with more than a few hundred nodes.

Timing:

Method	Time (4039 nodes)
Per-node loop	58 s
UNWIND batch	79 s
Target (bulk API)	< 1 s

Runnable demonstration:

import networkx as nx, time
from graphforge import GraphForge
from graphforge.datasets import load_dataset

db = GraphForge()
load_dataset(db, "snap-ego-facebook")
G = db.to_networkx()
pr = nx.pagerank(G)

t = time.time()
for nid, score in pr.items():
    db.execute("MATCH (n) WHERE id(n) = $nid SET n.pagerank = $pr",
               {"nid": int(nid), "pr": float(score)})
print(f"Write-back: {time.time()-t:.1f}s")   # ~58s for 4039 nodes

Proposed API (R-5):

# Batch property set keyed by internal ID
db.set_properties({nid: {"pagerank": score} for nid, score in pr.items()})

FP-5: `shortestPath()` raises SyntaxError ✅ Resolved in v0.3.10 (PR #497)¶

Resolution: shortestPath() and allShortestPaths() now parse correctly and raise NotImplementedError with a clear BFS workaround hint. The contradiction between network-analysis.md (which said "not built-in") and llm-workflows.md (which showed it in working code) is resolved — both docs now consistently document the NotImplementedError + BFS workaround.

Original problem: The parser had no grammar rule for shortestPath. Any query using it raised a SyntaxError immediately, before execution.

# Now raises NotImplementedError with BFS hint (not SyntaxError)
db.execute("MATCH p = shortestPath((a)-[:CONNECTED_TO*]-(b)) RETURN p")
# NotImplementedError: shortestPath() is not yet implemented.
# Use the BFS workaround: MATCH path = (a)-[*1..N]-(b)
#   RETURN length(path) ORDER BY length(path) LIMIT 1

Workaround: BFS approximation via approximate_distance() (shown in the doc).

FP-6: Triangle counting is impractical on the SNAP graph in Cypher¶

Problem: The triangle count query in the doc (S11) did not complete on the full 88 234-edge SNAP graph after more than 5 minutes. The (a)-(b)-(c)-(a) three-hop pattern is O(n³) in the Python executor at the current scale.

Impact: The doc presents triangle counting as a viable operation on the SNAP dataset without noting the performance cost.

Runnable verification:

import time
from graphforge import GraphForge
from graphforge.datasets import load_dataset

db = GraphForge()
load_dataset(db, "snap-ego-facebook")

t = time.time()
r = db.execute("""
    MATCH (a:Node)-[:CONNECTED_TO]-(b:Node)-[:CONNECTED_TO]-(c:Node)-[:CONNECTED_TO]-(a)
    WITH a, count(*) / 2 AS triangle_count
    RETURN a.id AS node_id, triangle_count
    ORDER BY triangle_count DESC LIMIT 5
""")
# Runs for > 5 minutes — do not run on the full SNAP graph
print(f"Elapsed: {time.time()-t:.1f}s")

Recommendation: Delegate triangle counting to igraph (transitivity_local_undirected()) which computes local clustering coefficients (equivalent) in < 10 ms.

G_ig = db.to_igraph().as_undirected()
local_clust = G_ig.transitivity_local_undirected()   # < 10ms for full graph

FP-7: `to_networkx()` node IDs are internal integers, not user property values¶

Problem: The exported NetworkX graph uses GraphForge's internal integer node IDs (not the id property). Algorithm results (PageRank dict, betweenness dict) are keyed by these internal IDs.

Impact: Users cannot directly correlate algorithm results with their application data without a lookup table.

Runnable demonstration:

G = db.to_networkx()
sample = list(G.nodes(data=True))[:3]
for nid, attrs in sample:
    print(f"NX id={nid} (internal int), user id={attrs['id']}")
# NX id=1 (internal int), user id=0
# NX id=2 (internal int), user id=1
# NX id=3 (internal int), user id=2

Workaround (build lookup table):

# Map internal_id → user_id using node attributes
id_map = {nid: attrs["id"] for nid, attrs in G.nodes(data=True)}

# Now correlate PageRank with user IDs
pr = nx.pagerank(G)
top_by_user_id = sorted(
    [(id_map[nid], score) for nid, score in pr.items()],
    key=lambda x: -x[1]
)[:5]

Proposed parameter (R-4):

# Use 'id' property as NX node keys
G = db.to_networkx(node_id_property="id")
# Now G.nodes() keys are user id strings

5. Competitive Analysis¶

Neo4j GDS (Graph Data Science)¶

Setup: Project graph into GDS in-memory format → run algorithm → write back.

// Step 1: Project
CALL gds.graph.project('myGraph', 'Person', 'FRIEND_OF')

// Step 2: Run PageRank
CALL gds.pageRank.stream('myGraph')
YIELD nodeId, score
RETURN gds.util.asNode(nodeId).name, score ORDER BY score DESC LIMIT 5

// Step 3: Write back
CALL gds.pageRank.write('myGraph', { writeProperty: 'pageRank' })

Algorithm catalog (partial): PageRank, Betweenness, Closeness, Louvain, Label Propagation, Node2Vec, FastRP, Triangle Count, Local Clustering Coefficient, K-means, Degree Centrality, Weakly/Strongly Connected Components, Shortest Paths (Dijkstra, A*).

LOC from data to PageRank: 3 lines Cypher.
Write-back: Built-in (.write mode). Zero Python overhead.
Performance: In-memory projection; GDS can handle graphs with hundreds of millions of edges.
Tradeoff: Requires running Neo4j server; not embeddable in a notebook without Docker/cloud.

igraph (standalone Python)¶

import igraph as ig
G = ig.Graph.Read_Edgelist("facebook.txt", directed=True)
pr = G.pagerank()                           # < 1ms for 88K edges
communities = G.as_undirected().community_multilevel()  # 30ms

LOC from data to PageRank: 2 lines.
Write-back: Modify vertex attributes in-memory. Zero Cypher overhead.
Performance: C++ backend. Betweenness exact on 88K edges: 0.19 s vs NetworkX sampled (k=200): 0.35 s.
Tradeoff: No Cypher query language; no persistent storage (must serialize to file); no dataset library.

graph-tool (standalone Python)¶

from graph_tool import load_graph
from graph_tool.centrality import pagerank
g = load_graph("facebook.gt")
pr = pagerank(g)

LOC from data to PageRank: 3 lines.
Performance: Compiled C++ with OpenMP parallelism; fastest available for large graphs (>1M edges). Statistical inference models (SBM) not available elsewhere.
Tradeoff: Notoriously difficult to install (no pip wheel); requires manual compilation or conda-forge. Not embeddable in standard notebooks.

NetworkX (standalone Python)¶

import networkx as nx
G = nx.read_edgelist("facebook.txt", create_using=nx.DiGraph)
pr = nx.pagerank(G)

LOC from data to PageRank: 2 lines.
Performance: Pure Python; PageRank 1.4 s, betweenness exact ~45 s on 88K edges.
Ecosystem: Richest algorithm library (300+ algorithms), best documented, pip-installable, widely used in notebooks.
Tradeoff: No persistence; pure in-memory; no query language.

KùzuDB¶

Setup: Embedded graph database with Cypher-compatible query language.

import kuzu
db = kuzu.Database("analysis.db")
conn = kuzu.Connection(db)
# Schema and data setup...
result = conn.execute("MATCH (n:Person) RETURN n")

Algorithm support: KùzuDB does not ship built-in graph algorithms. Export to NetworkX/igraph required, same as GraphForge.
Cypher coverage: Subset of openCypher; lacks some aggregation functions.
Write-back: No built-in bulk property set.
Dataset library: None.
Persistence: Native columnar storage; faster than SQLite for analytical queries on large graphs.
Tradeoff: More focused on OLAP workloads; no curated dataset library; no to_networkx() equivalent in API.

Summary Comparison¶

Feature	GraphForge	Neo4j GDS	igraph	NetworkX	KùzuDB
Embedded (no server)	✅	❌ (server)	✅	✅	✅
Cypher query language	✅	✅	❌	❌	✅ (subset)
Built-in PageRank	❌	✅	✅	✅ (scipy)	❌
Built-in Louvain	❌	✅	✅	✅ (NX 3.x)	❌
Built-in betweenness	❌	✅	✅ (fast)	✅ (slow)	❌
Dataset library	✅	❌	❌	❌	❌
`to_networkx()`	✅	❌	N/A	N/A	❌
`to_igraph()`	✅	❌	N/A	N/A	❌
Built-in write-back	❌	✅	✅ (in-memory)	✅ (in-memory)	❌
Persistent storage	✅ (SQLite)	✅	❌	❌	✅ (columnar)
LOC: data → PageRank	~5 (via NX)	3 (Cypher)	2	2	~5 (via NX)

GraphForge's differentiator: The only embedded graph database with both a Cypher query surface and first-class to_networkx()/to_igraph() export. Neo4j GDS wins on algorithm breadth; igraph wins on algorithm speed; NetworkX wins on ecosystem breadth. GraphForge's dataset library (load_dataset()) is unique.

6. Recommendations¶

R-1: Fix doc labels — use `Node`/`CONNECTED_TO` throughout (Critical, Low effort)¶

Update all SNAP-specific snippets in docs/use-cases/network-analysis.md to use the actual loader schema: :Node and :CONNECTED_TO. This fixes ~10 failing snippets with no library changes.

Decision: This assumes we keep the loader's hardcoded labels. If the team prefers :Person/:FRIEND_OF (more semantically meaningful for a social network), the fix goes in the CSV loader instead.

Issue: #477 (doc update)

R-2: Rewrite pandas/NX integration section to use new API (High, Low effort)¶

Replace the manual .value unwrap helper and nx.DiGraph() loop with to_dataframe() and to_networkx(). Show igraph as the performance-first alternative for algorithms.

Issue: #477 (same doc update issue)

R-3: Add `directed=False` parameter to `to_networkx()` and `to_igraph()` (Medium, Low effort)¶

# Proposed signatures
def to_networkx(
    self,
    query: str | None = None,
    node_label: str | None = None,
    rel_type: str | None = None,
    directed: bool = True,          # NEW
) -> nx.DiGraph | nx.Graph: ...

def to_igraph(
    self,
    query: str | None = None,
    node_label: str | None = None,
    rel_type: str | None = None,
    directed: bool = True,          # NEW
) -> ig.Graph: ...

Issue: #478

R-4: Add `node_id_property` parameter to `to_networkx()` and `to_igraph()` (Medium, Low effort)¶

# Proposed
G = db.to_networkx(node_id_property="id")
# NX node keys are now user id strings, not internal ints

G_ig = db.to_igraph(node_id_property="id")
# igraph vertex names are now user id strings

This eliminates the lookup-table workaround for write-back by making NX/igraph result dicts keyed by user property values.

Issue: #479

R-5: Bulk property write-back helper (Medium, Medium effort)¶

# Proposed API
db.set_node_properties(
    {node_id: {"pagerank": 0.023, "community": 3} for node_id, ...}
)

The method should accept a dict[int, dict[str, Any]] (keyed by internal ID, as returned by to_networkx()) and write all updates in a single transaction. Target: < 1 s for 4 039 nodes.

Issue: #480

R-6: Document igraph workflow as performance-first path (Medium, Low effort)¶

Add an "igraph for performance-critical analysis" section to network-analysis.md. Key points:

Use to_igraph() when: graph has > 10 K edges, need exact betweenness, or running Louvain.
Convert to undirected for community algorithms: G.as_undirected().
igraph betweenness (exact, 0.19 s) is faster than NetworkX sampled betweenness (k=200, 0.35 s).
igraph Louvain (0.03 s) vs NetworkX Louvain (0.35 s).

Issue: #477 (same doc update)

R-7: Fix `shortestPath()` example in `llm-workflows.md` (High, Low effort)¶

docs/use-cases/llm-workflows.md (lines 207–216) shows shortestPath() in a code example that implies it works. It does not — it raises SyntaxError. Replace with the approximate_distance() workaround from network-analysis.md, or add a comment noting it is not yet implemented (tracking #468).

Issue: #477 (same doc update)

7. New Bugs Found¶

Bug: ORDER BY on non-projected variable after RETURN DISTINCT¶

ORDER BY rejects original variable names after RETURN DISTINCT even when those variables were in scope before the RETURN clause. The error is UndefinedVariable: Variable 'X' is not projected by RETURN DISTINCT and cannot be used in ORDER BY. Using the alias works as a workaround.

File: src/graphforge/planner/planner.py (ORDER BY validation logic)
Issue: #481

Bug: Variable reuse across WITH boundary raises KeyError¶

Re-using a variable name in a MATCH clause after that variable left scope through a WITH boundary raises KeyError('variable_name') instead of a proper UndefinedVariable error. The query parses successfully but fails during planning or execution.

File: Likely src/graphforge/planner/planner.py or src/graphforge/executor/executor.py
Issue: #482

8. Issues Filed¶

Issue	Title	Priority
#468	`shortestPath()` SyntaxError — parser has no grammar rule (existing)	High
#476	research: SQLite-native FTS + embedding index architecture for hybrid search	Future
#477	docs: fix network-analysis.md — wrong SNAP labels, obsolete export patterns, shortestPath in llm-workflows	High
#478	feat: `directed=False` parameter for `to_networkx()` and `to_igraph()`	Medium
#479	feat: `node_id_property` parameter for `to_networkx()` and `to_igraph()`	Medium
#480	feat: `set_node_properties()` bulk write-back helper	Medium
#481	fix: ORDER BY on non-projected variable after RETURN DISTINCT raises UndefinedVariable	High
#482	fix: variable reuse across WITH boundary raises KeyError instead of UndefinedVariable	High

Appendix: Triangle Count Performance Note¶

The triangle count Cypher query (MATCH (a:Node)-(b:Node)-(c:Node)-(a)) did not complete on the full SNAP ego-facebook graph (4 039 nodes, 88 234 edges) after > 5 minutes in the test environment. This is expected: the Python executor enumerates all three-hop patterns in O(d³) per node where d is the average degree (~44). At this scale, triangle counting must be delegated to igraph:

G_ig = db.to_igraph().as_undirected()
local_clust = G_ig.transitivity_local_undirected()  # < 10ms — equivalent metric

If triangle count per node is required:

triangles = G_ig.motifs_randesu(size=3)   # or use NX triangle_count
# NetworkX:
import networkx as nx
G_und = db.to_networkx().to_undirected()
triangles = nx.triangles(G_und)           # ~0.5s for 88K edges

Research: Network Analysis Notebook Workflow and pandas/NetworkX Bridge Gaps¶

1. Executive Summary¶

2. Code Snippet Pass/Fail Matrix¶

S10 Variable-Reuse Bug (confirmed)¶

S16 ORDER BY After RETURN DISTINCT Bug (confirmed)¶

3. Full Pipeline Test¶

Environment¶

Stage 1: Load¶

Stage 2: Cypher Analytics¶

Stage 3: to_dataframe() Pipeline¶

Stage 4: to_networkx() Pipeline¶

Stage 5: to_igraph() Pipeline¶

Stage 6: Write-Back¶

Boilerplate Comparison¶

4. Friction Points¶

FP-1: Dataset schema mismatch — doc uses wrong labels¶

FP-2: Doc shows obsolete manual export patterns¶

FP-3: to_networkx() and to_igraph() always return directed graphs¶

FP-4: Write-back is critically slow — no bulk property-set API¶

FP-5: shortestPath() raises SyntaxError ✅ Resolved in v0.3.10 (PR #497)¶

FP-6: Triangle counting is impractical on the SNAP graph in Cypher¶

FP-7: to_networkx() node IDs are internal integers, not user property values¶

5. Competitive Analysis¶

Neo4j GDS (Graph Data Science)¶

igraph (standalone Python)¶

graph-tool (standalone Python)¶

NetworkX (standalone Python)¶

KùzuDB¶

Summary Comparison¶

6. Recommendations¶

R-1: Fix doc labels — use Node/CONNECTED_TO throughout (Critical, Low effort)¶

R-2: Rewrite pandas/NX integration section to use new API (High, Low effort)¶

R-3: Add directed=False parameter to to_networkx() and to_igraph() (Medium, Low effort)¶

R-4: Add node_id_property parameter to to_networkx() and to_igraph() (Medium, Low effort)¶

R-5: Bulk property write-back helper (Medium, Medium effort)¶

R-6: Document igraph workflow as performance-first path (Medium, Low effort)¶

R-7: Fix shortestPath() example in llm-workflows.md (High, Low effort)¶

7. New Bugs Found¶

Bug: ORDER BY on non-projected variable after RETURN DISTINCT¶

Bug: Variable reuse across WITH boundary raises KeyError¶

8. Issues Filed¶

Appendix: Triangle Count Performance Note¶

Stage 3: `to_dataframe()` Pipeline¶

Stage 4: `to_networkx()` Pipeline¶

Stage 5: `to_igraph()` Pipeline¶

FP-3: `to_networkx()` and `to_igraph()` always return directed graphs¶

FP-5: `shortestPath()` raises SyntaxError ✅ Resolved in v0.3.10 (PR #497)¶

FP-7: `to_networkx()` node IDs are internal integers, not user property values¶

R-1: Fix doc labels — use `Node`/`CONNECTED_TO` throughout (Critical, Low effort)¶

R-3: Add `directed=False` parameter to `to_networkx()` and `to_igraph()` (Medium, Low effort)¶

R-4: Add `node_id_property` parameter to `to_networkx()` and `to_igraph()` (Medium, Low effort)¶

R-7: Fix `shortestPath()` example in `llm-workflows.md` (High, Low effort)¶