SNAP Datasets¶
Status: ✅ Available in v0.2.1 (5 datasets)
Support for SNAP (Stanford Network Analysis Project) datasets for network analysis and research.
Overview¶
SNAP provides an extensive collection of real-world network datasets from Stanford University, widely used in research and education. GraphForge currently supports 5 curated datasets with more coming soon.
Dataset Source¶
- URL: https://snap.stanford.edu/data/
- License: Public Domain (for datasets we support)
- Category: Research, Academic
- Format: Edge-list (CSV/TSV), automatically downloaded and cached
Available Datasets (v0.2.1)¶
| Dataset | Nodes | Edges | Size | Category | Description |
|---|---|---|---|---|---|
snap-ego-facebook |
4,039 | 88,234 | 0.5 MB | Social | Facebook social circles (ego networks) |
snap-email-enron |
36,692 | 183,831 | 2.5 MB | Communication | Enron email communication network |
snap-ca-astroph |
18,772 | 198,110 | 1.8 MB | Collaboration | Astrophysics collaboration network |
snap-web-google |
875,713 | 5,105,039 | 75 MB | Web | Google web graph from 2002 |
snap-twitter-combined |
81,306 | 1,768,149 | 25 MB | Social | Twitter social circles (combined ego networks) |
Coming Soon¶
Additional SNAP datasets are planned for future releases. See Issue #70 for the roadmap to support 100+ SNAP datasets.
Quick Start¶
Load a Dataset¶
from graphforge import GraphForge
# Method 1: Load during initialization
gf = GraphForge.from_dataset("snap-ego-facebook")
# Method 2: Load into existing instance
gf = GraphForge()
from graphforge.datasets import load_dataset
load_dataset(gf, "snap-ego-facebook")
# Start querying immediately
results = gf.execute("""
MATCH (n)-[r]->(m)
RETURN count(DISTINCT n) AS users, count(r) AS connections
""")
print(f"Users: {results[0]['users'].value:,}")
print(f"Connections: {results[0]['connections'].value:,}")
Browse Available Datasets¶
from graphforge.datasets import list_datasets
# List all SNAP datasets
snap_datasets = list_datasets(source="snap")
for ds in snap_datasets:
print(f"{ds.name}")
print(f" {ds.description}")
print(f" Nodes: {ds.nodes:,}, Edges: {ds.edges:,}")
print(f" Category: {ds.category}, Size: {ds.size_mb} MB")
print()
Filter by Category¶
from graphforge.datasets import list_datasets
# Get only social network datasets
social_nets = list_datasets(source="snap", category="social")
# Get small datasets (< 10 MB)
small_datasets = list_datasets(source="snap", max_size_mb=10.0)
Example Queries¶
Social Network Analysis¶
gf = GraphForge.from_dataset("snap-ego-facebook")
# Find most connected users
results = gf.execute("""
MATCH (n)-[r]->()
RETURN n.id AS user, count(r) AS connections
ORDER BY connections DESC
LIMIT 10
""")
for row in results:
print(f"User {row['user'].value}: {row['connections'].value} connections")
Network Statistics¶
gf = GraphForge.from_dataset("snap-email-enron")
# Calculate basic network metrics
results = gf.execute("""
MATCH (n)-[r]-(m)
WITH count(DISTINCT n) AS nodes, count(r)/2 AS edges
RETURN
nodes,
edges,
edges * 1.0 / nodes AS avgDegree
""")
print(f"Nodes: {results[0]['nodes'].value:,}")
print(f"Edges: {results[0]['edges'].value:,}")
print(f"Average Degree: {results[0]['avgDegree'].value:.2f}")
Collaboration Patterns¶
gf = GraphForge.from_dataset("snap-ca-astroph")
# Find researchers with many collaborators
results = gf.execute("""
MATCH (researcher)-[:CONNECTED_TO]->(colleague)
WITH researcher, count(DISTINCT colleague) AS collaborators
WHERE collaborators > 10
RETURN researcher.id AS researcher, collaborators
ORDER BY collaborators DESC
LIMIT 20
""")
Web Graph Analysis¶
gf = GraphForge.from_dataset("snap-web-google")
# Find most linked-to pages
results = gf.execute("""
MATCH ()-[r]->(page)
RETURN page.id AS page, count(r) AS inlinks
ORDER BY inlinks DESC
LIMIT 10
""")
Dataset Details¶
Edge List Format¶
SNAP datasets use a simple edge-list format:
The CSV loader automatically:
- Detects delimiter (tab, comma, or space)
- Handles gzip compression
- Skips comment lines
- Creates nodes with Node label and id property
- Creates CONNECTED_TO relationships
- Supports optional weights
Download and Caching¶
Datasets are automatically downloaded and cached:
- First load: Downloads from SNAP website (~5-30 seconds depending on size)
- Subsequent loads: Uses cached file (instant)
- Cache location:
~/.graphforge/datasets/ - Cache expiry: 30 days (refreshes automatically)
Clear cache manually:
from graphforge.datasets import clear_cache
clear_cache("snap-ego-facebook") # Clear specific dataset
clear_cache() # Clear all cached datasets
Performance Tips¶
Large Datasets¶
For large datasets like snap-web-google:
# Use persistent storage to avoid memory issues
gf = GraphForge("analysis.db")
load_dataset(gf, "snap-web-google")
# Queries will use SQLite backend for efficiency
results = gf.execute("MATCH (n) RETURN count(n) AS count")
Incremental Analysis¶
# Load dataset once
gf = GraphForge("facebook.db")
if not gf.execute("MATCH (n) RETURN count(n) AS c")[0]['c'].value:
load_dataset(gf, "snap-ego-facebook")
# Run multiple analyses without reloading
analysis1 = gf.execute("MATCH (n) RETURN count(n)")
analysis2 = gf.execute("MATCH ()-[r]->() RETURN count(r)")
References¶
- SNAP Website
- SNAP Data Description
- GraphForge Issue #70 - Expand SNAP support
Related¶
- Dataset Overview - Complete dataset guide
- Loading Datasets - Quick start guide
- API Reference - Dataset API documentation