Dataset Integration¶
GraphForge provides built-in support for loading popular graph datasets from various public repositories, making it easy to experiment, benchmark, and learn with real-world data.
Overview¶
The dataset system allows you to: - Load pre-configured graph datasets with a single command - Explore standard benchmark datasets - Test queries on realistic data - Compare performance with other graph databases - Learn Cypher with meaningful examples
Supported Sources¶
✅ Available Now (v0.2.2)¶
SNAP (Stanford Network Analysis Project)¶
Real-world network datasets from Stanford, covering social networks, web graphs, citation networks, collaboration networks, and communication networks.
Status: 5 datasets available Use cases: Research, network analysis, graph algorithm development, academic projects
Available datasets:
- snap-ego-facebook - Facebook social circles (4K nodes, 88K edges)
- snap-email-enron - Enron email network (37K nodes, 184K edges)
- snap-ca-astroph - Astrophysics collaboration (19K nodes, 198K edges)
- snap-web-google - Google web graph (876K nodes, 5.1M edges)
- snap-twitter-combined - Twitter social circles (81K nodes, 1.8M edges)
📅 Coming Soon¶
LDBC (Linked Data Benchmark Council)¶
Standard benchmark datasets for graph database performance testing.
Status: Planned for v0.3.0 Use cases: Performance benchmarking, complex query testing
NetworkRepository¶
Large collection of diverse network datasets.
Status: Planned for future release Use cases: Network analysis, algorithm testing, research
Quick Start¶
Loading a Dataset¶
from graphforge import GraphForge
from graphforge.datasets import load_dataset
# Create a GraphForge instance
gf = GraphForge()
# Load a dataset by name
load_dataset(gf, "snap-ego-facebook")
# Query the loaded data
results = gf.execute("MATCH (n)-[r]->(m) RETURN n, r, m LIMIT 10")
Using the Convenience Method¶
from graphforge import GraphForge
# Load dataset during initialization
gf = GraphForge.from_dataset("snap-ego-facebook")
# Start querying immediately
results = gf.execute("MATCH (n) RETURN count(n) as node_count")
Listing Available Datasets¶
from graphforge.datasets import list_datasets
# Get all available datasets
datasets = list_datasets()
for ds in datasets:
print(f"{ds.name}: {ds.description}")
print(f" Source: {ds.source}")
print(f" Nodes: {ds.nodes:,}, Edges: {ds.edges:,}")
print(f" Size: {ds.size_mb:.1f} MB")
print()
Filtering by Source¶
from graphforge.datasets import list_datasets
# Get only SNAP datasets
snap_datasets = list_datasets(source="snap")
# Get only small datasets (< 10 MB)
small_datasets = [ds for ds in list_datasets() if ds.size_mb < 10]
CLI Usage¶
GraphForge provides command-line tools for working with datasets:
# List all available datasets
graphforge list-datasets
# Show detailed information about a dataset
graphforge dataset-info snap-ego-facebook
# Load a dataset
graphforge load-dataset snap-ego-facebook
# List datasets by source
graphforge list-datasets --source snap
# Clear dataset cache
graphforge clear-cache
Dataset Caching¶
Datasets are automatically cached locally after the first download to improve load times:
- Cache location:
~/.graphforge/datasets/ - Cache behavior: Downloaded once, reused on subsequent loads
- Cache management: Use
graphforge clear-cacheto remove cached datasets
Dataset Metadata¶
Each dataset includes comprehensive metadata:
from graphforge.datasets import get_dataset_info
info = get_dataset_info("snap-ego-facebook")
print(f"Name: {info.name}")
print(f"Description: {info.description}")
print(f"Source: {info.source}")
print(f"URL: {info.url}")
print(f"Nodes: {info.nodes:,}")
print(f"Edges: {info.edges:,}")
print(f"Labels: {', '.join(info.labels)}")
print(f"Relationship Types: {', '.join(info.relationship_types)}")
print(f"Size: {info.size_mb:.1f} MB")
print(f"License: {info.license}")
print(f"Category: {info.category}")
Jupyter Notebook Integration¶
Datasets work seamlessly in Jupyter notebooks:
# In a Jupyter notebook cell
from graphforge import GraphForge
from graphforge.datasets import load_dataset
gf = GraphForge()
load_dataset(gf, "snap-ego-facebook")
# Explore the data
gf.execute("MATCH (n) RETURN count(n) AS node_count")
Contributing Datasets¶
To add a new dataset source or specific dataset:
- Create a loader in
src/graphforge/datasets/ - Register the dataset in the registry
- Add tests in
tests/integration/test_datasets.py - Update documentation
See the development guide for details.
Troubleshooting¶
Download Failures¶
If a dataset download fails:
- Check your internet connection
- Try clearing the cache: graphforge clear-cache
- Manually download from the source URL
Memory Issues¶
Large datasets may require significant memory: - Start with smaller scale factors (e.g., LDBC SF0.001) - Use a machine with sufficient RAM - Consider using SQLite persistence for large datasets
Import Errors¶
If dataset import fails: - Check the dataset format compatibility - Verify the source URL is accessible - Report issues on GitHub
Related Documentation¶
- Cypher Script Loading - Load .cypher and .cql script files
- SNAP Datasets - Available now
- LDBC Datasets - Planned
- NetworkRepository Datasets - Planned
- API Reference
License Information¶
Each dataset has its own license. Always check the dataset metadata for licensing information before using in production or research.
Next Steps¶
- Explore SNAP for research and network analysis datasets
- Try LDBC for benchmarking (coming soon)
- Check NetworkRepository for diverse networks (coming soon)