Dataset Integration¶

GraphForge provides built-in support for loading popular graph datasets from various public repositories, making it easy to experiment, benchmark, and learn with real-world data.

Overview¶

The dataset system allows you to: - Load pre-configured graph datasets with a single command - Explore standard benchmark datasets - Test queries on realistic data - Compare performance with other graph databases - Learn Cypher with meaningful examples

Supported Sources¶

✅ Available Now (v0.2.2)¶

SNAP (Stanford Network Analysis Project)¶

Real-world network datasets from Stanford, covering social networks, web graphs, citation networks, collaboration networks, and communication networks.

Status: 5 datasets available Use cases: Research, network analysis, graph algorithm development, academic projects

Available datasets: - snap-ego-facebook - Facebook social circles (4K nodes, 88K edges) - snap-email-enron - Enron email network (37K nodes, 184K edges) - snap-ca-astroph - Astrophysics collaboration (19K nodes, 198K edges) - snap-web-google - Google web graph (876K nodes, 5.1M edges) - snap-twitter-combined - Twitter social circles (81K nodes, 1.8M edges)

📅 Coming Soon¶

LDBC (Linked Data Benchmark Council)¶

Standard benchmark datasets for graph database performance testing.

Status: Planned for v0.3.0 Use cases: Performance benchmarking, complex query testing

NetworkRepository ¶

Large collection of diverse network datasets.

Status: Planned for future release Use cases: Network analysis, algorithm testing, research

Quick Start¶

Loading a Dataset¶

from graphforge import GraphForge
from graphforge.datasets import load_dataset

# Create a GraphForge instance
gf = GraphForge()

# Load a dataset by name
load_dataset(gf, "snap-ego-facebook")

# Query the loaded data
results = gf.execute("MATCH (n)-[r]->(m) RETURN n, r, m LIMIT 10")

Using the Convenience Method¶

from graphforge import GraphForge

# Load dataset during initialization
gf = GraphForge.from_dataset("snap-ego-facebook")

# Start querying immediately
results = gf.execute("MATCH (n) RETURN count(n) as node_count")

Listing Available Datasets¶

from graphforge.datasets import list_datasets

# Get all available datasets
datasets = list_datasets()

for ds in datasets:
    print(f"{ds.name}: {ds.description}")
    print(f"  Source: {ds.source}")
    print(f"  Nodes: {ds.nodes:,}, Edges: {ds.edges:,}")
    print(f"  Size: {ds.size_mb:.1f} MB")
    print()

Filtering by Source¶

from graphforge.datasets import list_datasets

# Get only SNAP datasets
snap_datasets = list_datasets(source="snap")

# Get only small datasets (< 10 MB)
small_datasets = [ds for ds in list_datasets() if ds.size_mb < 10]

CLI Usage¶

GraphForge provides command-line tools for working with datasets:

# List all available datasets
graphforge list-datasets

# Show detailed information about a dataset
graphforge dataset-info snap-ego-facebook

# Load a dataset
graphforge load-dataset snap-ego-facebook

# List datasets by source
graphforge list-datasets --source snap

# Clear dataset cache
graphforge clear-cache

Dataset Caching¶

Datasets are automatically cached locally after the first download to improve load times:

Cache location: ~/.graphforge/datasets/
Cache behavior: Downloaded once, reused on subsequent loads
Cache management: Use graphforge clear-cache to remove cached datasets

Dataset Metadata¶

Each dataset includes comprehensive metadata:

from graphforge.datasets import get_dataset_info

info = get_dataset_info("snap-ego-facebook")

print(f"Name: {info.name}")
print(f"Description: {info.description}")
print(f"Source: {info.source}")
print(f"URL: {info.url}")
print(f"Nodes: {info.nodes:,}")
print(f"Edges: {info.edges:,}")
print(f"Labels: {', '.join(info.labels)}")
print(f"Relationship Types: {', '.join(info.relationship_types)}")
print(f"Size: {info.size_mb:.1f} MB")
print(f"License: {info.license}")
print(f"Category: {info.category}")

Jupyter Notebook Integration¶

Datasets work seamlessly in Jupyter notebooks:

# In a Jupyter notebook cell
from graphforge import GraphForge
from graphforge.datasets import load_dataset

gf = GraphForge()
load_dataset(gf, "snap-ego-facebook")

# Explore the data
gf.execute("MATCH (n) RETURN count(n) AS node_count")

Contributing Datasets¶

To add a new dataset source or specific dataset:

Create a loader in src/graphforge/datasets/
Register the dataset in the registry
Add tests in tests/integration/test_datasets.py
Update documentation

See the development guide for details.

Troubleshooting¶

Download Failures¶

If a dataset download fails: - Check your internet connection - Try clearing the cache: graphforge clear-cache - Manually download from the source URL

Memory Issues¶

Large datasets may require significant memory: - Start with smaller scale factors (e.g., LDBC SF0.001) - Use a machine with sufficient RAM - Consider using SQLite persistence for large datasets

Import Errors¶

If dataset import fails: - Check the dataset format compatibility - Verify the source URL is accessible - Report issues on GitHub

Cypher Script Loading - Load .cypher and .cql script files
SNAP Datasets - Available now
LDBC Datasets - Planned
NetworkRepository Datasets - Planned
API Reference

License Information¶

Each dataset has its own license. Always check the dataset metadata for licensing information before using in production or research.

Next Steps¶

Explore SNAP for research and network analysis datasets
Try LDBC for benchmarking (coming soon)
Check NetworkRepository for diverse networks (coming soon)