Network Analysis in Notebooks¶
Overview¶
GraphForge is a natural fit for network analysis workflows in Jupyter notebooks. It combines the expressiveness of openCypher graph queries with embedded persistence, so you can load a dataset once, run exploratory queries interactively, persist computed metrics, and share a single .db file as a reproducible artefact.
Why GraphForge Instead of NetworkX?¶
| Feature | GraphForge | NetworkX |
|---|---|---|
| Query language | Declarative openCypher — patterns, aggregations, paths | Imperative Python — write loops manually |
| Persistence | SQLite backend, single file, zero config | pickle / GraphML / custom serializers |
| Dataset library | load_dataset() for real-world graphs |
Manual download and parsing |
| Heterogeneous graphs | Multiple labels, arbitrary property maps | Single node/edge attribute dicts |
| Sharing | Pass one .db file |
Serialize and pray on pickle version |
| Scale | Up to ~10 M nodes comfortably | In-memory only, can exhaust RAM |
Both tools are complementary. GraphForge excels at storage, querying, and sharing; NetworkX excels at algorithmic graph theory. The pandas integration section shows how to bridge the two.
Installation¶
pip install graphforge
For persistent notebooks, use a file-backed database so results survive kernel restarts:
from graphforge import GraphForge
# In-memory (lost when kernel restarts)
db = GraphForge()
# Persistent (survives restarts, shareable)
db = GraphForge("analysis.db")
Loading a Real Dataset¶
GraphForge ships with a curated dataset library. The snap-ego-facebook dataset is a real-world social network from Stanford SNAP: ~4 000 nodes, ~88 000 edges.
from graphforge import GraphForge
from graphforge.datasets import load_dataset, list_datasets
# See all available datasets
print(list_datasets())
# ['snap-ego-facebook', 'snap-ca-grqc', ...]
db = GraphForge("facebook.db")
load_dataset(db, "snap-ego-facebook")
Verify the load:
result = db.execute("MATCH (n) RETURN count(n) AS nodes")
print(result[0]['nodes'].value) # 4039
result = db.execute("MATCH ()-[r]->() RETURN count(r) AS edges")
print(result[0]['edges'].value) # 88234
Quick schema check — what labels and relationship types exist?
labels = db.execute("MATCH (n) RETURN DISTINCT labels(n) AS lbls LIMIT 10")
for row in labels:
print(row['lbls'].value)
# ['Node']
rel_types = db.execute("MATCH ()-[r]->() RETURN DISTINCT type(r) AS t")
for row in rel_types:
print(row['t'].value)
# 'CONNECTED_TO'
Degree Distribution¶
Out-degree and In-degree¶
# Out-degree distribution (how many connections each node has)
out_deg = db.execute("""
MATCH (n:Node)
OPTIONAL MATCH (n)-[:CONNECTED_TO]->()
WITH n, count(*) AS out_deg
RETURN out_deg, count(n) AS freq
ORDER BY out_deg
""")
for row in out_deg[:5]:
print(f"out_deg={row['out_deg'].value} freq={row['freq'].value}")
# out_deg=0 freq=12
# out_deg=1 freq=243
# out_deg=2 freq=301
# ...
# In-degree distribution
in_deg = db.execute("""
MATCH (n:Node)
OPTIONAL MATCH ()-[:CONNECTED_TO]->(n)
WITH n, count(*) AS in_deg
RETURN in_deg, count(n) AS freq
ORDER BY in_deg
""")
Top-N Hubs¶
Find the most connected nodes by total degree:
hubs = db.execute("""
MATCH (n:Node)
OPTIONAL MATCH (n)-[:CONNECTED_TO]-(neighbor)
WITH n, count(DISTINCT neighbor) AS degree
RETURN n.id AS node_id, degree
ORDER BY degree DESC
LIMIT 10
""")
for row in hubs:
print(f"node {row['node_id'].value} degree={row['degree'].value}")
# node 107 degree=1045
# node 1684 degree=792
# node 3437 degree=547
# ...
Path Analysis¶
Reachability with Variable-Length Paths¶
Check whether two nodes are connected within a certain hop radius:
# Is node 107 reachable from node 0 within 3 hops?
result = db.execute("""
MATCH (a:Node {id: '0'})-[*1..3]-(b:Node {id: '107'})
RETURN count(*) AS path_count
""")
print(result[0]['path_count'].value) # 1 (or more, depending on paths found)
Finding All Short Paths Between Two Nodes¶
Variable-length patterns return every qualifying path. Use LIMIT to avoid combinatorial explosion:
paths = db.execute("""
MATCH p = (a:Node {id: '0'})-[:CONNECTED_TO*1..4]-(b:Node {id: '42'})
RETURN length(p) AS hops
ORDER BY hops
LIMIT 20
""")
for row in paths:
print(f"path of length {row['hops'].value}")
# path of length 2
# path of length 2
# path of length 3
# ...
Breadth-First Distance Approximation¶
shortestPath() is not yet implemented — calling it raises NotImplementedError with a workaround hint. The equivalent BFS pattern queries increasing hop counts:
def approximate_distance(db, src_id: str, dst_id: str, max_hops: int = 6) -> int | None:
"""Return the smallest hop count between two nodes, or None if unreachable."""
for hops in range(1, max_hops + 1):
result = db.execute(
f"""
MATCH (a:Node {{id: $src}})-[:CONNECTED_TO*{hops}]-(b:Node {{id: $dst}})
RETURN count(*) AS found
""",
{"src": src_id, "dst": dst_id},
)
if result[0]['found'].value > 0:
return hops
return None
dist = approximate_distance(db, "0", "107")
print(dist) # 2
Community Detection Proxy¶
Full community detection algorithms (Louvain, label propagation) require algorithmic libraries. GraphForge can act as the storage and retrieval layer while surfacing natural cluster structure through neighbor overlap queries.
Jaccard Similarity of Neighborhoods¶
Two nodes with heavily overlapping neighbor sets are likely in the same community:
# Jaccard similarity between two specific nodes
jaccard = db.execute("""
MATCH (a:Node {id: '107'})-[:CONNECTED_TO]-(common)-[:CONNECTED_TO]-(b:Node {id: '1684'})
WITH count(DISTINCT common) AS shared
MATCH (a:Node {id: '107'})-[:CONNECTED_TO]-(na)
WITH shared, count(DISTINCT na) AS deg_a
MATCH (b:Node {id: '1684'})-[:CONNECTED_TO]-(nb)
WITH shared, deg_a, count(DISTINCT nb) AS deg_b
RETURN toFloat(shared) / (deg_a + deg_b - shared) AS jaccard
""")
print(jaccard[0]['jaccard'].value) # e.g. 0.312
Finding Dense Triangles (Friend-of-Friend Clusters)¶
Triangles are the building blocks of tightly-knit communities:
# Count triangles each node participates in (sampled to top 20)
triangles = db.execute("""
MATCH (a:Node)-[:CONNECTED_TO]-(b:Node)-[:CONNECTED_TO]-(c:Node)-[:CONNECTED_TO]-(a)
WITH a, count(*) / 2 AS triangle_count
RETURN a.id AS node_id, triangle_count
ORDER BY triangle_count DESC
LIMIT 20
""")
for row in triangles:
print(f"node {row['node_id'].value} triangles={row['triangle_count'].value}")
Identifying Bridge Nodes¶
Nodes that connect otherwise-separate clusters have few common neighbors relative to their degree:
# Low neighbor overlap = structural bridge candidate
bridges = db.execute("""
MATCH (a:Node)-[:CONNECTED_TO]-(b:Node)
WITH a, b,
[(a)-[:CONNECTED_TO]-(x)-[:CONNECTED_TO]-(b) | x] AS common_neighbors
WITH a, b, size(common_neighbors) AS overlap
WHERE overlap = 0
WITH a, count(b) AS isolated_connections
RETURN a.id AS node_id, isolated_connections
ORDER BY isolated_connections DESC
LIMIT 10
""")
Citation Network Analysis¶
Building a Small Citation Network¶
db_cite = GraphForge("citations.db")
# Create papers
papers = [
("p1", "Attention Is All You Need", 2017),
("p2", "BERT: Pre-training of Deep Bidirectional Transformers", 2018),
("p3", "GPT-3: Language Models are Few-Shot Learners", 2020),
("p4", "Scaling Laws for Neural Language Models", 2020),
("p5", "Chain-of-Thought Prompting Elicits Reasoning", 2022),
("p6", "Emergent Abilities of Large Language Models", 2022),
]
for pid, title, year in papers:
db_cite.execute(
"CREATE (:Paper {id: $id, title: $title, year: $year})",
{"id": pid, "title": title, "year": year},
)
# Create citation edges (citing -> cited)
citations = [
("p2", "p1"), ("p3", "p1"), ("p3", "p2"),
("p4", "p1"), ("p4", "p3"), ("p5", "p1"),
("p5", "p2"), ("p5", "p3"), ("p6", "p3"),
("p6", "p4"), ("p6", "p5"),
]
for citing, cited in citations:
db_cite.execute(
"""
MATCH (a:Paper {id: $citing}), (b:Paper {id: $cited})
CREATE (a)-[:CITES]->(b)
""",
{"citing": citing, "cited": cited},
)
Most-Cited Papers¶
most_cited = db_cite.execute("""
MATCH (cited:Paper)<-[:CITES]-(citing:Paper)
WITH cited, count(citing) AS citation_count
RETURN cited.title AS title, cited.year AS year, citation_count
ORDER BY citation_count DESC
""")
for row in most_cited:
print(f"{row['citation_count'].value}x {row['title'].value} ({row['year'].value})")
# 5x Attention Is All You Need (2017)
# 4x GPT-3: Language Models are Few-Shot Learners (2020)
# 3x BERT: Pre-training of Deep Bidirectional Transformers (2018)
Co-Citation Pairs¶
Two papers that are frequently cited together are semantically related:
co_citations = db_cite.execute("""
MATCH (a:Paper)<-[:CITES]-(bridge:Paper)-[:CITES]->(b:Paper)
WHERE id(a) < id(b)
WITH a, b, count(bridge) AS co_cite_count
WHERE co_cite_count >= 2
RETURN a.title AS paper_a, b.title AS paper_b, co_cite_count
ORDER BY co_cite_count DESC
""")
for row in co_citations:
print(
f"co-cited {row['co_cite_count'].value}x: "
f"{row['paper_a'].value[:30]}... <-> {row['paper_b'].value[:30]}..."
)
Citation Chains (Intellectual Lineage)¶
# What is the full citation ancestry of the Chain-of-Thought paper?
ancestry = db_cite.execute("""
MATCH (root:Paper {id: 'p5'})-[:CITES*1..5]->(ancestor:Paper)
RETURN DISTINCT ancestor.title AS title, ancestor.year AS year
ORDER BY ancestor.year
""")
for row in ancestry:
print(f"{row['year'].value} {row['title'].value}")
# 2017 Attention Is All You Need
# 2018 BERT: Pre-training of Deep Bidirectional Transformers
# 2020 GPT-3: Language Models are Few-Shot Learners
Temporal Network Analysis¶
Track when relationships were created and analyze network evolution over time.
Creating a Temporal Network¶
db_temp = GraphForge("temporal_social.db")
# Create users
for uid, name in [("u1", "Alice"), ("u2", "Bob"), ("u3", "Carol"), ("u4", "Dave")]:
db_temp.execute(
"CREATE (:User {id: $id, name: $name})",
{"id": uid, "name": name},
)
# Create FOLLOWS edges with a `since` date string (ISO 8601)
follows = [
("u1", "u2", "2023-01-15"),
("u1", "u3", "2023-03-22"),
("u2", "u3", "2023-06-01"),
("u3", "u4", "2024-01-10"),
("u2", "u4", "2024-02-28"),
("u4", "u1", "2024-11-05"),
]
for src, dst, since in follows:
db_temp.execute(
"""
MATCH (a:User {id: $src}), (b:User {id: $dst})
CREATE (a)-[:FOLLOWS {since: $since}]->(b)
""",
{"src": src, "dst": dst, "since": since},
)
Querying by Date¶
# All connections formed on or after 2024-01-01
new_connections = db_temp.execute("""
MATCH (a:User)-[r:FOLLOWS]->(b:User)
WHERE r.since >= '2024-01-01'
RETURN a.name AS follower, b.name AS followee, r.since AS since
ORDER BY r.since
""")
for row in new_connections:
print(f"{row['follower'].value} -> {row['followee'].value} ({row['since'].value})")
# Carol -> Dave (2024-01-10)
# Bob -> Dave (2024-02-28)
# Dave -> Alice (2024-11-05)
Network Growth Over Time¶
# Count cumulative edges added per quarter (string comparison on ISO dates works correctly)
growth = db_temp.execute("""
MATCH ()-[r:FOLLOWS]->()
WITH
CASE
WHEN r.since < '2023-04-01' THEN 'Q1 2023'
WHEN r.since < '2023-07-01' THEN 'Q2 2023'
WHEN r.since < '2024-01-01' THEN 'Q3-Q4 2023'
ELSE '2024+'
END AS period,
count(*) AS new_edges
RETURN period, new_edges
ORDER BY period
""")
for row in growth:
print(f"{row['period'].value}: {row['new_edges'].value} new connections")
# Q1 2023: 2
# Q2 2023: 1
# Q3-Q4 2023: 0 (implicit — no rows returned for empty periods)
# 2024+: 3
Nodes Added Before a Cutoff with OPTIONAL MATCH¶
Use OPTIONAL MATCH to find users with no recent connections:
inactive = db_temp.execute("""
MATCH (u:User)
OPTIONAL MATCH (u)-[r:FOLLOWS]->()
WHERE r.since >= '2024-01-01'
WITH u, count(r) AS recent_out
WHERE recent_out = 0
RETURN u.name AS name
""")
for row in inactive:
print(row['name'].value)
# Alice
# Bob (Bob's only 2024 edge is incoming, not outgoing)
Persisting Analysis Results¶
Computed metrics can be stored back as node properties using SET. This makes them available for future sessions and avoids recomputation.
Writing Degree Back to Nodes¶
# Compute and persist degree centrality for the Facebook graph
db.execute("""
MATCH (n:Node)
OPTIONAL MATCH (n)-[:CONNECTED_TO]-(neighbor)
WITH n, count(DISTINCT neighbor) AS degree
SET n.degree = degree
""")
Verify the write:
sample = db.execute("""
MATCH (n:Node)
WHERE EXISTS { MATCH (n)-[:CONNECTED_TO]-() }
RETURN n.id AS id, n.degree AS degree
ORDER BY degree DESC
LIMIT 5
""")
for row in sample:
print(f"node {row['id'].value} degree={row['degree'].value}")
Persisting Triangle Counts¶
db.execute("""
MATCH (a:Node)-[:CONNECTED_TO]-(b:Node)-[:CONNECTED_TO]-(c:Node)-[:CONNECTED_TO]-(a)
WITH a, count(*) / 2 AS triangles
SET a.triangle_count = triangles
""")
Reloading Across Sessions¶
Because db = GraphForge("facebook.db") persists to disk, any subsequent session can access the pre-computed properties immediately:
# New notebook session — no recomputation needed
db = GraphForge("facebook.db")
top_nodes = db.execute("""
MATCH (n:Node)
WHERE n.degree IS NOT NULL
RETURN n.id AS id, n.degree AS degree, n.triangle_count AS triangles
ORDER BY degree DESC
LIMIT 10
""")
Integration with pandas¶
db.to_dataframe(query) runs a query and returns a pandas DataFrame with all CypherValues automatically unwrapped. No helper function needed.
Degree Distribution Plot¶
import matplotlib.pyplot as plt
df = db.to_dataframe("""
MATCH (n:Node)
RETURN n.degree AS degree
ORDER BY degree
""")
df['degree'].value_counts().sort_index().plot(
kind='bar', figsize=(12, 4),
title='Degree Distribution — ego-Facebook',
xlabel='Degree', ylabel='Number of nodes',
logy=True,
)
plt.tight_layout()
plt.savefig("degree_distribution.png", dpi=150)
Exporting to NetworkX¶
db.to_networkx() exports the full graph (or a filtered subset) as a NetworkX graph. Node attributes include all properties plus _labels; edge attributes include all properties plus _type.
import networkx as nx
# Export as directed DiGraph (default)
G = db.to_networkx()
# Export as undirected Graph — required for community detection algorithms
G_undirected = db.to_networkx(directed=False)
# Use node property as graph key instead of internal IDs
G = db.to_networkx(node_id_property="id")
pr = nx.pagerank(G)
# pr = {"107": 0.0094, "1684": 0.0012, ...} ← user id values, not internal ints
# Filter to a specific label / relationship type
G_sub = db.to_networkx(node_label="Node", rel_type="CONNECTED_TO", directed=False)
# Community detection (requires undirected graph)
communities = nx.algorithms.community.louvain_communities(G_sub, seed=42)
print(nx.density(G_sub))
print(nx.average_clustering(G_sub))
Writing Algorithm Results Back¶
set_node_properties() bulk-writes algorithm results directly to the storage layer in a single transaction — avoiding the ~79s UNWIND overhead on 4 000-node graphs.
import networkx as nx
# Compute centrality and write back with internal IDs
G = db.to_networkx()
dc = nx.degree_centrality(G)
db.set_node_properties({nid: {"degree_centrality": score} for nid, score in dc.items()})
# Or use a user-facing property as the key (matches node_id_property on export)
G = db.to_networkx(node_id_property="id")
dc = nx.degree_centrality(G)
db.set_node_properties(
{uid: {"degree_centrality": score} for uid, score in dc.items()},
node_id_property="id",
)
# Results are immediately queryable in Cypher
top = db.execute("""
MATCH (n:Node)
RETURN n.id AS id, n.degree_centrality AS dc
ORDER BY dc DESC LIMIT 10
""")
Aggregated Statistics Table¶
stats = db.to_dataframe("""
MATCH (n:Node)
WHERE n.degree IS NOT NULL
RETURN
count(n) AS total_nodes,
avg(n.degree) AS avg_degree,
min(n.degree) AS min_degree,
max(n.degree) AS max_degree,
stDev(n.degree) AS std_degree
""")
print(stats.to_string(index=False))
# total_nodes avg_degree min_degree max_degree std_degree
# 4039 43.69 1 1045 52.41
Graph Algorithms with db.gds¶
db.gds provides compiled graph algorithms that run against the igraph or NetworkX backend — orders
of magnitude faster than equivalent Cypher pattern enumeration for structural computations.
Centrality¶
from graphforge import GraphForge
from graphforge.datasets import load_dataset
db = GraphForge("analysis.db")
load_dataset(db, "snap-ego-facebook")
# PageRank — write scores back to nodes, then query via Cypher
db.gds.pagerank(write_property="pagerank")
top = db.execute("""
MATCH (n)
RETURN n.id AS node, n.pagerank AS score
ORDER BY n.pagerank DESC
LIMIT 10
""")
# Betweenness centrality — stream mode (no mutation)
bc_scores = db.gds.betweenness_centrality() # dict[node_id, float]
top_id = max(bc_scores, key=bc_scores.get)
Community Detection¶
# Louvain community detection — write community ID to each node
db.gds.louvain(write_property="community")
# Count nodes per community via Cypher
communities = db.execute("""
MATCH (n)
RETURN n.community AS community, count(*) AS size
ORDER BY size DESC
LIMIT 10
""")
All 8 Available Algorithms¶
| Method | Category | Returns |
|---|---|---|
db.gds.pagerank(write_property=) |
Centrality | Influence score |
db.gds.betweenness_centrality() |
Centrality | Bridge importance |
db.gds.closeness_centrality() |
Centrality | Reachability |
db.gds.degree_centrality() |
Centrality | Connectivity |
db.gds.louvain(write_property=) |
Community | Community ID |
db.gds.connected_components() |
Community | Component ID |
db.gds.clustering_coefficient() |
Structural | Local density |
db.gds.triangle_count() |
Structural | Triangle count |
All methods accept optional node_label and rel_type to restrict the subgraph, and directed=True/False.
Omit write_property for stream mode (returns dict[node_id, float] without mutating nodes).
Summary¶
| Task | GraphForge |
|---|---|
| Count nodes / edges | MATCH (n) RETURN count(n) |
| Degree of a node | MATCH (n)-[r]-() RETURN count(r) |
| Top-N hubs | ORDER BY degree DESC LIMIT N |
| Variable-length paths | (a)-[:CONNECTED_TO*1..4]-(b) |
| Triangle count | db.gds.triangle_count() (fast) or Cypher (small graphs) |
| Temporal filter | WHERE r.since >= '2024-01-01' |
| Persist metric | db.gds.pagerank(write_property="pr") |
| Load built-in dataset | load_dataset(db, "snap-ego-facebook") |
Next Steps¶
- openCypher Reference — full clause and function coverage
- Cypher Query Guide — patterns, aggregations, and list comprehensions
- Dataset Catalogue — all available built-in datasets
- AI Agent Grounding — using GraphForge as an ontology backend