Skip to content

Network Analysis in Notebooks

Overview

GraphForge is a natural fit for network analysis workflows in Jupyter notebooks. It combines the expressiveness of openCypher graph queries with embedded persistence, so you can load a dataset once, run exploratory queries interactively, persist computed metrics, and share a single .db file as a reproducible artefact.

Why GraphForge Instead of NetworkX?

Feature GraphForge NetworkX
Query language Declarative openCypher — patterns, aggregations, paths Imperative Python — write loops manually
Persistence SQLite backend, single file, zero config pickle / GraphML / custom serializers
Dataset library load_dataset() for real-world graphs Manual download and parsing
Heterogeneous graphs Multiple labels, arbitrary property maps Single node/edge attribute dicts
Sharing Pass one .db file Serialize and pray on pickle version
Scale Up to ~10 M nodes comfortably In-memory only, can exhaust RAM

Both tools are complementary. GraphForge excels at storage, querying, and sharing; NetworkX excels at algorithmic graph theory. The pandas integration section shows how to bridge the two.

Installation

pip install graphforge

For persistent notebooks, use a file-backed database so results survive kernel restarts:

from graphforge import GraphForge

# In-memory (lost when kernel restarts)
db = GraphForge()

# Persistent (survives restarts, shareable)
db = GraphForge("analysis.db")

Loading a Real Dataset

GraphForge ships with a curated dataset library. The snap-ego-facebook dataset is a real-world social network from Stanford SNAP: ~4 000 nodes, ~88 000 edges.

from graphforge import GraphForge
from graphforge.datasets import load_dataset, list_datasets

# See all available datasets
print(list_datasets())
# ['snap-ego-facebook', 'snap-ca-grqc', ...]

db = GraphForge("facebook.db")
load_dataset(db, "snap-ego-facebook")

Verify the load:

result = db.execute("MATCH (n) RETURN count(n) AS nodes")
print(result[0]['nodes'].value)   # 4039

result = db.execute("MATCH ()-[r]->() RETURN count(r) AS edges")
print(result[0]['edges'].value)   # 88234

Quick schema check — what labels and relationship types exist?

labels = db.execute("MATCH (n) RETURN DISTINCT labels(n) AS lbls LIMIT 10")
for row in labels:
    print(row['lbls'].value)
# ['Node']

rel_types = db.execute("MATCH ()-[r]->() RETURN DISTINCT type(r) AS t")
for row in rel_types:
    print(row['t'].value)
# 'CONNECTED_TO'

Degree Distribution

Out-degree and In-degree

# Out-degree distribution (how many connections each node has)
out_deg = db.execute("""
    MATCH (n:Node)
    OPTIONAL MATCH (n)-[:CONNECTED_TO]->()
    WITH n, count(*) AS out_deg
    RETURN out_deg, count(n) AS freq
    ORDER BY out_deg
""")

for row in out_deg[:5]:
    print(f"out_deg={row['out_deg'].value}  freq={row['freq'].value}")
# out_deg=0   freq=12
# out_deg=1   freq=243
# out_deg=2   freq=301
# ...
# In-degree distribution
in_deg = db.execute("""
    MATCH (n:Node)
    OPTIONAL MATCH ()-[:CONNECTED_TO]->(n)
    WITH n, count(*) AS in_deg
    RETURN in_deg, count(n) AS freq
    ORDER BY in_deg
""")

Top-N Hubs

Find the most connected nodes by total degree:

hubs = db.execute("""
    MATCH (n:Node)
    OPTIONAL MATCH (n)-[:CONNECTED_TO]-(neighbor)
    WITH n, count(DISTINCT neighbor) AS degree
    RETURN n.id AS node_id, degree
    ORDER BY degree DESC
    LIMIT 10
""")

for row in hubs:
    print(f"node {row['node_id'].value}  degree={row['degree'].value}")
# node 107   degree=1045
# node 1684  degree=792
# node 3437  degree=547
# ...

Path Analysis

Reachability with Variable-Length Paths

Check whether two nodes are connected within a certain hop radius:

# Is node 107 reachable from node 0 within 3 hops?
result = db.execute("""
    MATCH (a:Node {id: '0'})-[*1..3]-(b:Node {id: '107'})
    RETURN count(*) AS path_count
""")
print(result[0]['path_count'].value)   # 1 (or more, depending on paths found)

Finding All Short Paths Between Two Nodes

Variable-length patterns return every qualifying path. Use LIMIT to avoid combinatorial explosion:

paths = db.execute("""
    MATCH p = (a:Node {id: '0'})-[:CONNECTED_TO*1..4]-(b:Node {id: '42'})
    RETURN length(p) AS hops
    ORDER BY hops
    LIMIT 20
""")

for row in paths:
    print(f"path of length {row['hops'].value}")
# path of length 2
# path of length 2
# path of length 3
# ...

Breadth-First Distance Approximation

shortestPath() is not yet implemented — calling it raises NotImplementedError with a workaround hint. The equivalent BFS pattern queries increasing hop counts:

def approximate_distance(db, src_id: str, dst_id: str, max_hops: int = 6) -> int | None:
    """Return the smallest hop count between two nodes, or None if unreachable."""
    for hops in range(1, max_hops + 1):
        result = db.execute(
            f"""
            MATCH (a:Node {{id: $src}})-[:CONNECTED_TO*{hops}]-(b:Node {{id: $dst}})
            RETURN count(*) AS found
            """,
            {"src": src_id, "dst": dst_id},
        )
        if result[0]['found'].value > 0:
            return hops
    return None

dist = approximate_distance(db, "0", "107")
print(dist)   # 2

Community Detection Proxy

Full community detection algorithms (Louvain, label propagation) require algorithmic libraries. GraphForge can act as the storage and retrieval layer while surfacing natural cluster structure through neighbor overlap queries.

Jaccard Similarity of Neighborhoods

Two nodes with heavily overlapping neighbor sets are likely in the same community:

# Jaccard similarity between two specific nodes
jaccard = db.execute("""
    MATCH (a:Node {id: '107'})-[:CONNECTED_TO]-(common)-[:CONNECTED_TO]-(b:Node {id: '1684'})
    WITH count(DISTINCT common) AS shared
    MATCH (a:Node {id: '107'})-[:CONNECTED_TO]-(na)
    WITH shared, count(DISTINCT na) AS deg_a
    MATCH (b:Node {id: '1684'})-[:CONNECTED_TO]-(nb)
    WITH shared, deg_a, count(DISTINCT nb) AS deg_b
    RETURN toFloat(shared) / (deg_a + deg_b - shared) AS jaccard
""")
print(jaccard[0]['jaccard'].value)   # e.g. 0.312

Finding Dense Triangles (Friend-of-Friend Clusters)

Triangles are the building blocks of tightly-knit communities:

# Count triangles each node participates in (sampled to top 20)
triangles = db.execute("""
    MATCH (a:Node)-[:CONNECTED_TO]-(b:Node)-[:CONNECTED_TO]-(c:Node)-[:CONNECTED_TO]-(a)
    WITH a, count(*) / 2 AS triangle_count
    RETURN a.id AS node_id, triangle_count
    ORDER BY triangle_count DESC
    LIMIT 20
""")

for row in triangles:
    print(f"node {row['node_id'].value}  triangles={row['triangle_count'].value}")

Identifying Bridge Nodes

Nodes that connect otherwise-separate clusters have few common neighbors relative to their degree:

# Low neighbor overlap = structural bridge candidate
bridges = db.execute("""
    MATCH (a:Node)-[:CONNECTED_TO]-(b:Node)
    WITH a, b,
         [(a)-[:CONNECTED_TO]-(x)-[:CONNECTED_TO]-(b) | x] AS common_neighbors
    WITH a, b, size(common_neighbors) AS overlap
    WHERE overlap = 0
    WITH a, count(b) AS isolated_connections
    RETURN a.id AS node_id, isolated_connections
    ORDER BY isolated_connections DESC
    LIMIT 10
""")

Citation Network Analysis

Building a Small Citation Network

db_cite = GraphForge("citations.db")

# Create papers
papers = [
    ("p1", "Attention Is All You Need", 2017),
    ("p2", "BERT: Pre-training of Deep Bidirectional Transformers", 2018),
    ("p3", "GPT-3: Language Models are Few-Shot Learners", 2020),
    ("p4", "Scaling Laws for Neural Language Models", 2020),
    ("p5", "Chain-of-Thought Prompting Elicits Reasoning", 2022),
    ("p6", "Emergent Abilities of Large Language Models", 2022),
]

for pid, title, year in papers:
    db_cite.execute(
        "CREATE (:Paper {id: $id, title: $title, year: $year})",
        {"id": pid, "title": title, "year": year},
    )

# Create citation edges (citing -> cited)
citations = [
    ("p2", "p1"), ("p3", "p1"), ("p3", "p2"),
    ("p4", "p1"), ("p4", "p3"), ("p5", "p1"),
    ("p5", "p2"), ("p5", "p3"), ("p6", "p3"),
    ("p6", "p4"), ("p6", "p5"),
]

for citing, cited in citations:
    db_cite.execute(
        """
        MATCH (a:Paper {id: $citing}), (b:Paper {id: $cited})
        CREATE (a)-[:CITES]->(b)
        """,
        {"citing": citing, "cited": cited},
    )

Most-Cited Papers

most_cited = db_cite.execute("""
    MATCH (cited:Paper)<-[:CITES]-(citing:Paper)
    WITH cited, count(citing) AS citation_count
    RETURN cited.title AS title, cited.year AS year, citation_count
    ORDER BY citation_count DESC
""")

for row in most_cited:
    print(f"{row['citation_count'].value}x  {row['title'].value} ({row['year'].value})")
# 5x  Attention Is All You Need (2017)
# 4x  GPT-3: Language Models are Few-Shot Learners (2020)
# 3x  BERT: Pre-training of Deep Bidirectional Transformers (2018)

Co-Citation Pairs

Two papers that are frequently cited together are semantically related:

co_citations = db_cite.execute("""
    MATCH (a:Paper)<-[:CITES]-(bridge:Paper)-[:CITES]->(b:Paper)
    WHERE id(a) < id(b)
    WITH a, b, count(bridge) AS co_cite_count
    WHERE co_cite_count >= 2
    RETURN a.title AS paper_a, b.title AS paper_b, co_cite_count
    ORDER BY co_cite_count DESC
""")

for row in co_citations:
    print(
        f"co-cited {row['co_cite_count'].value}x: "
        f"{row['paper_a'].value[:30]}...  <->  {row['paper_b'].value[:30]}..."
    )

Citation Chains (Intellectual Lineage)

# What is the full citation ancestry of the Chain-of-Thought paper?
ancestry = db_cite.execute("""
    MATCH (root:Paper {id: 'p5'})-[:CITES*1..5]->(ancestor:Paper)
    RETURN DISTINCT ancestor.title AS title, ancestor.year AS year
    ORDER BY ancestor.year
""")

for row in ancestry:
    print(f"{row['year'].value}  {row['title'].value}")
# 2017  Attention Is All You Need
# 2018  BERT: Pre-training of Deep Bidirectional Transformers
# 2020  GPT-3: Language Models are Few-Shot Learners

Temporal Network Analysis

Track when relationships were created and analyze network evolution over time.

Creating a Temporal Network

db_temp = GraphForge("temporal_social.db")

# Create users
for uid, name in [("u1", "Alice"), ("u2", "Bob"), ("u3", "Carol"), ("u4", "Dave")]:
    db_temp.execute(
        "CREATE (:User {id: $id, name: $name})",
        {"id": uid, "name": name},
    )

# Create FOLLOWS edges with a `since` date string (ISO 8601)
follows = [
    ("u1", "u2", "2023-01-15"),
    ("u1", "u3", "2023-03-22"),
    ("u2", "u3", "2023-06-01"),
    ("u3", "u4", "2024-01-10"),
    ("u2", "u4", "2024-02-28"),
    ("u4", "u1", "2024-11-05"),
]

for src, dst, since in follows:
    db_temp.execute(
        """
        MATCH (a:User {id: $src}), (b:User {id: $dst})
        CREATE (a)-[:FOLLOWS {since: $since}]->(b)
        """,
        {"src": src, "dst": dst, "since": since},
    )

Querying by Date

# All connections formed on or after 2024-01-01
new_connections = db_temp.execute("""
    MATCH (a:User)-[r:FOLLOWS]->(b:User)
    WHERE r.since >= '2024-01-01'
    RETURN a.name AS follower, b.name AS followee, r.since AS since
    ORDER BY r.since
""")

for row in new_connections:
    print(f"{row['follower'].value} -> {row['followee'].value}  ({row['since'].value})")
# Carol -> Dave  (2024-01-10)
# Bob   -> Dave  (2024-02-28)
# Dave  -> Alice (2024-11-05)

Network Growth Over Time

# Count cumulative edges added per quarter (string comparison on ISO dates works correctly)
growth = db_temp.execute("""
    MATCH ()-[r:FOLLOWS]->()
    WITH
        CASE
            WHEN r.since < '2023-04-01' THEN 'Q1 2023'
            WHEN r.since < '2023-07-01' THEN 'Q2 2023'
            WHEN r.since < '2024-01-01' THEN 'Q3-Q4 2023'
            ELSE '2024+'
        END AS period,
        count(*) AS new_edges
    RETURN period, new_edges
    ORDER BY period
""")

for row in growth:
    print(f"{row['period'].value}: {row['new_edges'].value} new connections")
# Q1 2023: 2
# Q2 2023: 1
# Q3-Q4 2023: 0   (implicit — no rows returned for empty periods)
# 2024+: 3

Nodes Added Before a Cutoff with OPTIONAL MATCH

Use OPTIONAL MATCH to find users with no recent connections:

inactive = db_temp.execute("""
    MATCH (u:User)
    OPTIONAL MATCH (u)-[r:FOLLOWS]->()
    WHERE r.since >= '2024-01-01'
    WITH u, count(r) AS recent_out
    WHERE recent_out = 0
    RETURN u.name AS name
""")

for row in inactive:
    print(row['name'].value)
# Alice
# Bob  (Bob's only 2024 edge is incoming, not outgoing)

Persisting Analysis Results

Computed metrics can be stored back as node properties using SET. This makes them available for future sessions and avoids recomputation.

Writing Degree Back to Nodes

# Compute and persist degree centrality for the Facebook graph
db.execute("""
    MATCH (n:Node)
    OPTIONAL MATCH (n)-[:CONNECTED_TO]-(neighbor)
    WITH n, count(DISTINCT neighbor) AS degree
    SET n.degree = degree
""")

Verify the write:

sample = db.execute("""
    MATCH (n:Node)
    WHERE EXISTS { MATCH (n)-[:CONNECTED_TO]-() }
    RETURN n.id AS id, n.degree AS degree
    ORDER BY degree DESC
    LIMIT 5
""")

for row in sample:
    print(f"node {row['id'].value}  degree={row['degree'].value}")

Persisting Triangle Counts

db.execute("""
    MATCH (a:Node)-[:CONNECTED_TO]-(b:Node)-[:CONNECTED_TO]-(c:Node)-[:CONNECTED_TO]-(a)
    WITH a, count(*) / 2 AS triangles
    SET a.triangle_count = triangles
""")

Reloading Across Sessions

Because db = GraphForge("facebook.db") persists to disk, any subsequent session can access the pre-computed properties immediately:

# New notebook session — no recomputation needed
db = GraphForge("facebook.db")

top_nodes = db.execute("""
    MATCH (n:Node)
    WHERE n.degree IS NOT NULL
    RETURN n.id AS id, n.degree AS degree, n.triangle_count AS triangles
    ORDER BY degree DESC
    LIMIT 10
""")

Integration with pandas

db.to_dataframe(query) runs a query and returns a pandas DataFrame with all CypherValues automatically unwrapped. No helper function needed.

Degree Distribution Plot

import matplotlib.pyplot as plt

df = db.to_dataframe("""
    MATCH (n:Node)
    RETURN n.degree AS degree
    ORDER BY degree
""")

df['degree'].value_counts().sort_index().plot(
    kind='bar', figsize=(12, 4),
    title='Degree Distribution — ego-Facebook',
    xlabel='Degree', ylabel='Number of nodes',
    logy=True,
)
plt.tight_layout()
plt.savefig("degree_distribution.png", dpi=150)

Exporting to NetworkX

db.to_networkx() exports the full graph (or a filtered subset) as a NetworkX graph. Node attributes include all properties plus _labels; edge attributes include all properties plus _type.

import networkx as nx

# Export as directed DiGraph (default)
G = db.to_networkx()

# Export as undirected Graph — required for community detection algorithms
G_undirected = db.to_networkx(directed=False)

# Use node property as graph key instead of internal IDs
G = db.to_networkx(node_id_property="id")
pr = nx.pagerank(G)
# pr = {"107": 0.0094, "1684": 0.0012, ...}  ← user id values, not internal ints

# Filter to a specific label / relationship type
G_sub = db.to_networkx(node_label="Node", rel_type="CONNECTED_TO", directed=False)

# Community detection (requires undirected graph)
communities = nx.algorithms.community.louvain_communities(G_sub, seed=42)

print(nx.density(G_sub))
print(nx.average_clustering(G_sub))

Writing Algorithm Results Back

set_node_properties() bulk-writes algorithm results directly to the storage layer in a single transaction — avoiding the ~79s UNWIND overhead on 4 000-node graphs.

import networkx as nx

# Compute centrality and write back with internal IDs
G = db.to_networkx()
dc = nx.degree_centrality(G)
db.set_node_properties({nid: {"degree_centrality": score} for nid, score in dc.items()})

# Or use a user-facing property as the key (matches node_id_property on export)
G = db.to_networkx(node_id_property="id")
dc = nx.degree_centrality(G)
db.set_node_properties(
    {uid: {"degree_centrality": score} for uid, score in dc.items()},
    node_id_property="id",
)

# Results are immediately queryable in Cypher
top = db.execute("""
    MATCH (n:Node)
    RETURN n.id AS id, n.degree_centrality AS dc
    ORDER BY dc DESC LIMIT 10
""")

Aggregated Statistics Table

stats = db.to_dataframe("""
    MATCH (n:Node)
    WHERE n.degree IS NOT NULL
    RETURN
        count(n)              AS total_nodes,
        avg(n.degree)         AS avg_degree,
        min(n.degree)         AS min_degree,
        max(n.degree)         AS max_degree,
        stDev(n.degree)       AS std_degree
""")

print(stats.to_string(index=False))
#  total_nodes  avg_degree  min_degree  max_degree  std_degree
#         4039       43.69           1        1045       52.41

Graph Algorithms with db.gds

db.gds provides compiled graph algorithms that run against the igraph or NetworkX backend — orders of magnitude faster than equivalent Cypher pattern enumeration for structural computations.

Centrality

from graphforge import GraphForge
from graphforge.datasets import load_dataset

db = GraphForge("analysis.db")
load_dataset(db, "snap-ego-facebook")

# PageRank — write scores back to nodes, then query via Cypher
db.gds.pagerank(write_property="pagerank")
top = db.execute("""
    MATCH (n)
    RETURN n.id AS node, n.pagerank AS score
    ORDER BY n.pagerank DESC
    LIMIT 10
""")

# Betweenness centrality — stream mode (no mutation)
bc_scores = db.gds.betweenness_centrality()   # dict[node_id, float]
top_id = max(bc_scores, key=bc_scores.get)

Community Detection

# Louvain community detection — write community ID to each node
db.gds.louvain(write_property="community")

# Count nodes per community via Cypher
communities = db.execute("""
    MATCH (n)
    RETURN n.community AS community, count(*) AS size
    ORDER BY size DESC
    LIMIT 10
""")

All 8 Available Algorithms

Method Category Returns
db.gds.pagerank(write_property=) Centrality Influence score
db.gds.betweenness_centrality() Centrality Bridge importance
db.gds.closeness_centrality() Centrality Reachability
db.gds.degree_centrality() Centrality Connectivity
db.gds.louvain(write_property=) Community Community ID
db.gds.connected_components() Community Component ID
db.gds.clustering_coefficient() Structural Local density
db.gds.triangle_count() Structural Triangle count

All methods accept optional node_label and rel_type to restrict the subgraph, and directed=True/False. Omit write_property for stream mode (returns dict[node_id, float] without mutating nodes).


Summary

Task GraphForge
Count nodes / edges MATCH (n) RETURN count(n)
Degree of a node MATCH (n)-[r]-() RETURN count(r)
Top-N hubs ORDER BY degree DESC LIMIT N
Variable-length paths (a)-[:CONNECTED_TO*1..4]-(b)
Triangle count db.gds.triangle_count() (fast) or Cypher (small graphs)
Temporal filter WHERE r.since >= '2024-01-01'
Persist metric db.gds.pagerank(write_property="pr")
Load built-in dataset load_dataset(db, "snap-ego-facebook")

Next Steps