NetworkRepository Datasets¶
GraphForge provides seamless integration with NetworkRepository, a comprehensive collection of network datasets widely used in network science research.
Overview¶
NetworkRepository offers 1,000+ network datasets covering diverse domains including social networks, biological networks, infrastructure graphs, collaboration networks, and more. GraphForge includes 10 carefully selected datasets that are:
- Small size (< 1 MB) - Fast downloads and quick loading
- Diverse categories - Social, biological, infrastructure, collaboration, communication
- Well-documented - Clear metadata and provenance
- Classic datasets - Widely cited in research literature
Available Datasets¶
All datasets below have been validated for v0.3.0 release (100% success rate).
Social Networks¶
✅ Zachary's Karate Club (netrepo-karate)¶
Classic social network showing the fission of a university karate club into two groups.
- Nodes: 34
- Edges: 78
- Category: Social
- Use case: Community detection, network visualization
from graphforge import GraphForge
from graphforge.datasets import load_dataset
gf = GraphForge()
load_dataset(gf, "netrepo-karate")
# Find nodes with highest degree (most connections)
results = gf.execute("""
MATCH (n)-[r]-()
RETURN n, count(r) AS degree
ORDER BY degree DESC
LIMIT 5
""")
✅ Political Books Network (netrepo-polbooks)¶
Co-purchasing network of political books on Amazon, with books as nodes and co-purchases as edges.
- Nodes: 105
- Edges: 441
- Category: Social
- Use case: Community structure, political alignment clustering
✅ College Football Network (netrepo-football)¶
Network of Division IA college football games in Fall 2000. Nodes are teams, edges are games played.
- Nodes: 115
- Edges: 612
- Category: Social
- Use case: Community detection (conferences), network structure
✅ Les Miserables Network (netrepo-lesmis)¶
Character co-appearance network from Victor Hugo's Les Misérables. Characters appear as nodes, co-appearances in chapters as edges.
- Nodes: 77
- Edges: 254
- Category: Social
- Use case: Character importance analysis, story structure
Biological Networks¶
✅ Dolphin Social Network (netrepo-dolphins)¶
Social relationships between 62 dolphins in a community living off Doubtful Sound, New Zealand.
- Nodes: 62
- Edges: 159
- Category: Biological
- Use case: Animal social networks, community structure
# Find most connected dolphins
gf = GraphForge()
load_dataset(gf, "netrepo-dolphins")
results = gf.execute("""
MATCH (d)-[r]-()
RETURN d, count(DISTINCT r) AS connections
ORDER BY connections DESC
LIMIT 10
""")
✅ C. elegans Neural Network (netrepo-celegans)¶
Neural network of the nematode worm Caenorhabditis elegans. Neurons are nodes, synaptic connections are edges.
- Nodes: 453
- Edges: 2,025
- Category: Biological
- Use case: Neural network analysis, biological networks, connectomics
Collaboration Networks¶
✅ Network Science Coauthorship (netrepo-netscience)¶
Collaboration network of scientists working on network science. Nodes are authors, edges are coauthorships.
- Nodes: 1,589
- Edges: 2,742
- Category: Collaboration
- Use case: Scientific collaboration patterns, community detection
# Find most prolific collaborators
gf = GraphForge()
load_dataset(gf, "netrepo-netscience")
results = gf.execute("""
MATCH (author)-[:CONNECTED_TO]-(coauthor)
RETURN author, count(DISTINCT coauthor) AS collaborators
ORDER BY collaborators DESC
LIMIT 10
""")
✅ Jazz Musicians Collaboration (netrepo-jazz)¶
Collaboration network between jazz musicians. Nodes are musicians, edges are collaborations on albums.
- Nodes: 198
- Edges: 2,742
- Category: Collaboration
- Use case: Music collaboration networks, artist communities
Infrastructure Networks¶
✅ US Western Power Grid (netrepo-power)¶
Topology of the Western States Power Grid of the United States. Nodes are generators/transformers/substations, edges are transmission lines.
- Nodes: 4,941
- Edges: 6,594
- Category: Infrastructure
- Use case: Network resilience, infrastructure analysis, graph topology
Communication Networks¶
✅ European Email Network (netrepo-email-eu)¶
Email communication network from a large European research institution. Nodes are email addresses (anonymized), edges are email exchanges.
- Nodes: 32,430
- Edges: 54,397
- Category: Communication
- Use case: Communication patterns, information flow, organizational structure
Basic Usage¶
Loading a Dataset¶
from graphforge import GraphForge
from graphforge.datasets import load_dataset
# Create a GraphForge instance
gf = GraphForge()
# Load a NetworkRepository dataset
load_dataset(gf, "netrepo-karate")
# Query the data
results = gf.execute("MATCH (n) RETURN count(n) AS node_count")
print(f"Loaded {results[0]['node_count'].value} nodes")
Discovering Datasets¶
from graphforge.datasets import list_datasets, get_dataset_info
# List all NetworkRepository datasets
all_datasets = list_datasets()
netrepo_datasets = [d for d in all_datasets if d.startswith("netrepo-")]
print(f"Found {len(netrepo_datasets)} NetworkRepository datasets")
# Get metadata for a specific dataset
info = get_dataset_info("netrepo-karate")
print(f"Dataset: {info.name}")
print(f"Description: {info.description}")
print(f"Nodes: {info.nodes}, Edges: {info.edges}")
print(f"Category: {info.category}")
print(f"Size: {info.size_mb} MB")
Filtering by Category¶
from graphforge.datasets import list_datasets, get_dataset_info
# Get all biological network datasets
all_datasets = list_datasets()
biological = [
d for d in all_datasets
if d.startswith("netrepo-") and get_dataset_info(d).category == "biological"
]
for dataset_name in biological:
info = get_dataset_info(dataset_name)
print(f"- {info.name}: {info.description}")
Advanced Usage¶
Analyzing Network Structure¶
from graphforge import GraphForge
from graphforge.datasets import load_dataset
gf = GraphForge()
load_dataset(gf, "netrepo-karate")
# Calculate degree distribution
degree_dist = gf.execute("""
MATCH (n)
OPTIONAL MATCH (n)-[r]-()
WITH n, count(r) AS degree
RETURN degree, count(n) AS frequency
ORDER BY degree
""")
for row in degree_dist:
print(f"Degree {row['degree'].value}: {row['frequency'].value} nodes")
Finding Communities¶
# Find triangles in the network (3-node cliques)
triangles = gf.execute("""
MATCH (a)-[:CONNECTED_TO]-(b)-[:CONNECTED_TO]-(c)-[:CONNECTED_TO]-(a)
WHERE id(a) < id(b) AND id(b) < id(c)
RETURN count(*) AS triangle_count
""")
print(f"Found {triangles[0]['triangle_count'].value} triangles")
Comparing Datasets¶
from graphforge import GraphForge
from graphforge.datasets import load_dataset
def get_network_stats(dataset_name):
gf = GraphForge()
load_dataset(gf, dataset_name)
# Count nodes and edges
nodes = gf.execute("MATCH (n) RETURN count(n) AS count")[0]["count"].value
edges = gf.execute("MATCH ()-[r]->() RETURN count(r) AS count")[0]["count"].value
# Calculate average degree
avg_degree = (2 * edges) / nodes if nodes > 0 else 0
return {
"dataset": dataset_name,
"nodes": nodes,
"edges": edges,
"avg_degree": round(avg_degree, 2),
}
# Compare several datasets
datasets = ["netrepo-karate", "netrepo-dolphins", "netrepo-football"]
for ds in datasets:
stats = get_network_stats(ds)
print(f"{stats['dataset']}: {stats['nodes']} nodes, {stats['edges']} edges, "
f"avg degree {stats['avg_degree']}")
Data Format¶
NetworkRepository datasets are provided in GraphML format, an XML-based standard for graph data. GraphForge's GraphML loader automatically:
- Parses node and edge attributes with type preservation
- Converts GraphML types (int, double, boolean, string) to Cypher types
- Handles compressed
.graphml.gzfiles - Supports multi-label nodes
- Preserves all property metadata
Example GraphML structure:
<graphml>
<key id="d0" for="node" attr.name="label" attr.type="string"/>
<key id="d1" for="node" attr.name="name" attr.type="string"/>
<graph edgedefault="directed">
<node id="n0">
<data key="d0">Node</data>
<data key="d1">Alice</data>
</node>
<node id="n1">
<data key="d0">Node</data>
<data key="d1">Bob</data>
</node>
<edge source="n0" target="n1"/>
</graph>
</graphml>
Caching¶
Downloaded datasets are automatically cached in ~/.cache/graphforge/datasets/ to avoid redundant downloads:
from graphforge.datasets import load_dataset
# First load downloads the file
load_dataset(gf, "netrepo-karate") # Downloads ~10 KB
# Second load uses cached file (instant)
load_dataset(gf, "netrepo-karate") # Loads from cache
Performance¶
All NetworkRepository datasets in GraphForge are optimized for quick loading. All datasets validated for v0.3.0 release with 100% success rate.
| Dataset | Nodes | Edges | First Load | Cached Load | Validation |
|---|---|---|---|---|---|
| karate | 34 | 78 | 0.00s | 0.00s | ✅ Passed |
| dolphins | 62 | 159 | 0.00s | 0.00s | ✅ Passed |
| polbooks | 105 | 441 | 0.00s | 0.00s | ✅ Passed |
| football | 115 | 612 | 0.00s | 0.00s | ✅ Passed |
| lesmis | 77 | 254 | 0.00s | 0.00s | ✅ Passed |
| celegans | 453 | 2,025 | 0.01s | 0.00s | ✅ Passed |
| netscience | 1,589 | 2,742 | 0.02s | 0.01s | ✅ Passed |
| jazz | 198 | 2,742 | 0.01s | 0.01s | ✅ Passed |
| power | 4,941 | 6,594 | 0.09s | 0.08s | ✅ Passed |
| email-eu | 32,430 | 54,397 | 0.18s | 0.17s | ✅ Passed |
Validation includes: - ✅ Download and caching functionality - ✅ Node and edge count accuracy - ✅ Sample query execution (MATCH, aggregations) - ✅ Performance benchmarks
Load times measured on modern laptop. Actual times may vary.
Persistence¶
Datasets can be loaded into persistent GraphForge instances:
from graphforge import GraphForge
from graphforge.datasets import load_dataset
# Load into persistent database
gf = GraphForge("karate.db")
load_dataset(gf, "netrepo-karate")
gf.close()
# Later, reopen and query (data persisted)
gf2 = GraphForge("karate.db")
results = gf2.execute("MATCH (n) RETURN count(n) AS count")
print(f"Loaded {results[0]['count'].value} nodes from persistent storage")
gf2.close()
Citation¶
If you use NetworkRepository datasets in your research, please cite:
Ryan A. Rossi and Nesreen K. Ahmed. (2015). The Network Data Repository with Interactive Graph Analytics and Visualization. In Proceedings of the Twenty-Ninth AAAI Conference on Artificial Intelligence. http://networkrepository.com/
Individual datasets may have their own citations. Check the NetworkRepository website for dataset-specific attribution requirements.