Skip to content

Dataset Integration

GraphForge provides built-in support for loading popular graph datasets from various public repositories, making it easy to experiment, benchmark, and learn with real-world data.

Overview

The dataset system allows you to: - Load pre-configured graph datasets with a single command - Explore standard benchmark datasets - Test queries on realistic data - Compare performance with other graph databases - Learn Cypher with meaningful examples

Supported Sources

✅ Available Now (v0.2.2)

SNAP (Stanford Network Analysis Project)

Real-world network datasets from Stanford, covering social networks, web graphs, citation networks, collaboration networks, and communication networks.

Status: 5 datasets available Use cases: Research, network analysis, graph algorithm development, academic projects

Available datasets: - snap-ego-facebook - Facebook social circles (4K nodes, 88K edges) - snap-email-enron - Enron email network (37K nodes, 184K edges) - snap-ca-astroph - Astrophysics collaboration (19K nodes, 198K edges) - snap-web-google - Google web graph (876K nodes, 5.1M edges) - snap-twitter-combined - Twitter social circles (81K nodes, 1.8M edges)

📅 Coming Soon

LDBC (Linked Data Benchmark Council)

Standard benchmark datasets for graph database performance testing.

Status: Planned for v0.3.0 Use cases: Performance benchmarking, complex query testing

NetworkRepository

Large collection of diverse network datasets.

Status: Planned for future release Use cases: Network analysis, algorithm testing, research

Quick Start

Loading a Dataset

from graphforge import GraphForge
from graphforge.datasets import load_dataset

# Create a GraphForge instance
gf = GraphForge()

# Load a dataset by name
load_dataset(gf, "snap-ego-facebook")

# Query the loaded data
results = gf.execute("MATCH (n)-[r]->(m) RETURN n, r, m LIMIT 10")

Using the Convenience Method

from graphforge import GraphForge

# Load dataset during initialization
gf = GraphForge.from_dataset("snap-ego-facebook")

# Start querying immediately
results = gf.execute("MATCH (n) RETURN count(n) as node_count")

Listing Available Datasets

from graphforge.datasets import list_datasets

# Get all available datasets
datasets = list_datasets()

for ds in datasets:
    print(f"{ds.name}: {ds.description}")
    print(f"  Source: {ds.source}")
    print(f"  Nodes: {ds.nodes:,}, Edges: {ds.edges:,}")
    print(f"  Size: {ds.size_mb:.1f} MB")
    print()

Filtering by Source

from graphforge.datasets import list_datasets

# Get only SNAP datasets
snap_datasets = list_datasets(source="snap")

# Get only small datasets (< 10 MB)
small_datasets = [ds for ds in list_datasets() if ds.size_mb < 10]

CLI Usage

GraphForge provides command-line tools for working with datasets:

# List all available datasets
graphforge list-datasets

# Show detailed information about a dataset
graphforge dataset-info snap-ego-facebook

# Load a dataset
graphforge load-dataset snap-ego-facebook

# List datasets by source
graphforge list-datasets --source snap

# Clear dataset cache
graphforge clear-cache

Dataset Caching

Datasets are automatically cached locally after the first download to improve load times:

  • Cache location: ~/.graphforge/datasets/
  • Cache behavior: Downloaded once, reused on subsequent loads
  • Cache management: Use graphforge clear-cache to remove cached datasets

Dataset Metadata

Each dataset includes comprehensive metadata:

from graphforge.datasets import get_dataset_info

info = get_dataset_info("snap-ego-facebook")

print(f"Name: {info.name}")
print(f"Description: {info.description}")
print(f"Source: {info.source}")
print(f"URL: {info.url}")
print(f"Nodes: {info.nodes:,}")
print(f"Edges: {info.edges:,}")
print(f"Labels: {', '.join(info.labels)}")
print(f"Relationship Types: {', '.join(info.relationship_types)}")
print(f"Size: {info.size_mb:.1f} MB")
print(f"License: {info.license}")
print(f"Category: {info.category}")

Jupyter Notebook Integration

Datasets work seamlessly in Jupyter notebooks:

# In a Jupyter notebook cell
from graphforge import GraphForge
from graphforge.datasets import load_dataset

gf = GraphForge()
load_dataset(gf, "snap-ego-facebook")

# Explore the data
gf.execute("MATCH (n) RETURN count(n) AS node_count")

Contributing Datasets

To add a new dataset source or specific dataset:

  1. Create a loader in src/graphforge/datasets/
  2. Register the dataset in the registry
  3. Add tests in tests/integration/test_datasets.py
  4. Update documentation

See the development guide for details.

Troubleshooting

Download Failures

If a dataset download fails: - Check your internet connection - Try clearing the cache: graphforge clear-cache - Manually download from the source URL

Memory Issues

Large datasets may require significant memory: - Start with smaller scale factors (e.g., LDBC SF0.001) - Use a machine with sufficient RAM - Consider using SQLite persistence for large datasets

Import Errors

If dataset import fails: - Check the dataset format compatibility - Verify the source URL is accessible - Report issues on GitHub

License Information

Each dataset has its own license. Always check the dataset metadata for licensing information before using in production or research.

Next Steps

  • Explore SNAP for research and network analysis datasets
  • Try LDBC for benchmarking (coming soon)
  • Check NetworkRepository for diverse networks (coming soon)