SNAP Datasets¶

Status: ✅ Available in v0.2.1 (5 datasets)

Support for SNAP (Stanford Network Analysis Project) datasets for network analysis and research.

Overview¶

SNAP provides an extensive collection of real-world network datasets from Stanford University, widely used in research and education. GraphForge currently supports 5 curated datasets with more coming soon.

Dataset Source¶

URL: https://snap.stanford.edu/data/
License: Public Domain (for datasets we support)
Category: Research, Academic
Format: Edge-list (CSV/TSV), automatically downloaded and cached

Available Datasets (v0.2.1)¶

Dataset	Nodes	Edges	Size	Category	Description
`snap-ego-facebook`	4,039	88,234	0.5 MB	Social	Facebook social circles (ego networks)
`snap-email-enron`	36,692	183,831	2.5 MB	Communication	Enron email communication network
`snap-ca-astroph`	18,772	198,110	1.8 MB	Collaboration	Astrophysics collaboration network
`snap-web-google`	875,713	5,105,039	75 MB	Web	Google web graph from 2002
`snap-twitter-combined`	81,306	1,768,149	25 MB	Social	Twitter social circles (combined ego networks)

Coming Soon¶

Additional SNAP datasets are planned for future releases. See Issue #70 for the roadmap to support 100+ SNAP datasets.

Quick Start¶

Load a Dataset¶

from graphforge import GraphForge

# Method 1: Load during initialization
gf = GraphForge.from_dataset("snap-ego-facebook")

# Method 2: Load into existing instance
gf = GraphForge()
from graphforge.datasets import load_dataset
load_dataset(gf, "snap-ego-facebook")

# Start querying immediately
results = gf.execute("""
    MATCH (n)-[r]->(m)
    RETURN count(DISTINCT n) AS users, count(r) AS connections
""")

print(f"Users: {results[0]['users'].value:,}")
print(f"Connections: {results[0]['connections'].value:,}")

Browse Available Datasets¶

from graphforge.datasets import list_datasets

# List all SNAP datasets
snap_datasets = list_datasets(source="snap")

for ds in snap_datasets:
    print(f"{ds.name}")
    print(f"  {ds.description}")
    print(f"  Nodes: {ds.nodes:,}, Edges: {ds.edges:,}")
    print(f"  Category: {ds.category}, Size: {ds.size_mb} MB")
    print()

Filter by Category¶

from graphforge.datasets import list_datasets

# Get only social network datasets
social_nets = list_datasets(source="snap", category="social")

# Get small datasets (< 10 MB)
small_datasets = list_datasets(source="snap", max_size_mb=10.0)

Example Queries¶

gf = GraphForge.from_dataset("snap-ego-facebook")

# Find most connected users
results = gf.execute("""
    MATCH (n)-[r]->()
    RETURN n.id AS user, count(r) AS connections
    ORDER BY connections DESC
    LIMIT 10
""")

for row in results:
    print(f"User {row['user'].value}: {row['connections'].value} connections")

Network Statistics¶

gf = GraphForge.from_dataset("snap-email-enron")

# Calculate basic network metrics
results = gf.execute("""
    MATCH (n)-[r]-(m)
    WITH count(DISTINCT n) AS nodes, count(r)/2 AS edges
    RETURN
        nodes,
        edges,
        edges * 1.0 / nodes AS avgDegree
""")

print(f"Nodes: {results[0]['nodes'].value:,}")
print(f"Edges: {results[0]['edges'].value:,}")
print(f"Average Degree: {results[0]['avgDegree'].value:.2f}")

Collaboration Patterns¶

gf = GraphForge.from_dataset("snap-ca-astroph")

# Find researchers with many collaborators
results = gf.execute("""
    MATCH (researcher)-[:CONNECTED_TO]->(colleague)
    WITH researcher, count(DISTINCT colleague) AS collaborators
    WHERE collaborators > 10
    RETURN researcher.id AS researcher, collaborators
    ORDER BY collaborators DESC
    LIMIT 20
""")

Web Graph Analysis¶

gf = GraphForge.from_dataset("snap-web-google")

# Find most linked-to pages
results = gf.execute("""
    MATCH ()-[r]->(page)
    RETURN page.id AS page, count(r) AS inlinks
    ORDER BY inlinks DESC
    LIMIT 10
""")

Dataset Details¶

Edge List Format¶

SNAP datasets use a simple edge-list format:

# Comment lines (ignored)
source_id target_id [weight]

The CSV loader automatically: - Detects delimiter (tab, comma, or space) - Handles gzip compression - Skips comment lines - Creates nodes with Node label and id property - Creates CONNECTED_TO relationships - Supports optional weights

Download and Caching¶

Datasets are automatically downloaded and cached:

First load: Downloads from SNAP website (~5-30 seconds depending on size)
Subsequent loads: Uses cached file (instant)
Cache location: ~/.graphforge/datasets/
Cache expiry: 30 days (refreshes automatically)

Clear cache manually:

from graphforge.datasets import clear_cache

clear_cache("snap-ego-facebook")  # Clear specific dataset
clear_cache()                       # Clear all cached datasets

Performance Tips¶

Large Datasets¶

For large datasets like snap-web-google:

# Use persistent storage to avoid memory issues
gf = GraphForge("analysis.db")
load_dataset(gf, "snap-web-google")

# Queries will use SQLite backend for efficiency
results = gf.execute("MATCH (n) RETURN count(n) AS count")

Incremental Analysis¶

# Load dataset once
gf = GraphForge("facebook.db")
if not gf.execute("MATCH (n) RETURN count(n) AS c")[0]['c'].value:
    load_dataset(gf, "snap-ego-facebook")

# Run multiple analyses without reloading
analysis1 = gf.execute("MATCH (n) RETURN count(n)")
analysis2 = gf.execute("MATCH ()-[r]->() RETURN count(r)")

References¶

SNAP Website
SNAP Data Description
GraphForge Issue #70 - Expand SNAP support

Dataset Overview - Complete dataset guide
Loading Datasets - Quick start guide
API Reference - Dataset API documentation