Skip to content

NetworkRepository Datasets

GraphForge provides seamless integration with NetworkRepository, a comprehensive collection of network datasets widely used in network science research.

Overview

NetworkRepository offers 1,000+ network datasets covering diverse domains including social networks, biological networks, infrastructure graphs, collaboration networks, and more. GraphForge includes 10 carefully selected datasets that are:

  • Small size (< 1 MB) - Fast downloads and quick loading
  • Diverse categories - Social, biological, infrastructure, collaboration, communication
  • Well-documented - Clear metadata and provenance
  • Classic datasets - Widely cited in research literature

Available Datasets

All datasets below have been validated for v0.3.0 release (100% success rate).

Social Networks

✅ Zachary's Karate Club (netrepo-karate)

Classic social network showing the fission of a university karate club into two groups.

  • Nodes: 34
  • Edges: 78
  • Category: Social
  • Use case: Community detection, network visualization
from graphforge import GraphForge
from graphforge.datasets import load_dataset

gf = GraphForge()
load_dataset(gf, "netrepo-karate")

# Find nodes with highest degree (most connections)
results = gf.execute("""
    MATCH (n)-[r]-()
    RETURN n, count(r) AS degree
    ORDER BY degree DESC
    LIMIT 5
""")

✅ Political Books Network (netrepo-polbooks)

Co-purchasing network of political books on Amazon, with books as nodes and co-purchases as edges.

  • Nodes: 105
  • Edges: 441
  • Category: Social
  • Use case: Community structure, political alignment clustering

✅ College Football Network (netrepo-football)

Network of Division IA college football games in Fall 2000. Nodes are teams, edges are games played.

  • Nodes: 115
  • Edges: 612
  • Category: Social
  • Use case: Community detection (conferences), network structure

✅ Les Miserables Network (netrepo-lesmis)

Character co-appearance network from Victor Hugo's Les Misérables. Characters appear as nodes, co-appearances in chapters as edges.

  • Nodes: 77
  • Edges: 254
  • Category: Social
  • Use case: Character importance analysis, story structure

Biological Networks

✅ Dolphin Social Network (netrepo-dolphins)

Social relationships between 62 dolphins in a community living off Doubtful Sound, New Zealand.

  • Nodes: 62
  • Edges: 159
  • Category: Biological
  • Use case: Animal social networks, community structure
# Find most connected dolphins
gf = GraphForge()
load_dataset(gf, "netrepo-dolphins")

results = gf.execute("""
    MATCH (d)-[r]-()
    RETURN d, count(DISTINCT r) AS connections
    ORDER BY connections DESC
    LIMIT 10
""")

✅ C. elegans Neural Network (netrepo-celegans)

Neural network of the nematode worm Caenorhabditis elegans. Neurons are nodes, synaptic connections are edges.

  • Nodes: 453
  • Edges: 2,025
  • Category: Biological
  • Use case: Neural network analysis, biological networks, connectomics

Collaboration Networks

✅ Network Science Coauthorship (netrepo-netscience)

Collaboration network of scientists working on network science. Nodes are authors, edges are coauthorships.

  • Nodes: 1,589
  • Edges: 2,742
  • Category: Collaboration
  • Use case: Scientific collaboration patterns, community detection
# Find most prolific collaborators
gf = GraphForge()
load_dataset(gf, "netrepo-netscience")

results = gf.execute("""
    MATCH (author)-[:CONNECTED_TO]-(coauthor)
    RETURN author, count(DISTINCT coauthor) AS collaborators
    ORDER BY collaborators DESC
    LIMIT 10
""")

✅ Jazz Musicians Collaboration (netrepo-jazz)

Collaboration network between jazz musicians. Nodes are musicians, edges are collaborations on albums.

  • Nodes: 198
  • Edges: 2,742
  • Category: Collaboration
  • Use case: Music collaboration networks, artist communities

Infrastructure Networks

✅ US Western Power Grid (netrepo-power)

Topology of the Western States Power Grid of the United States. Nodes are generators/transformers/substations, edges are transmission lines.

  • Nodes: 4,941
  • Edges: 6,594
  • Category: Infrastructure
  • Use case: Network resilience, infrastructure analysis, graph topology

Communication Networks

✅ European Email Network (netrepo-email-eu)

Email communication network from a large European research institution. Nodes are email addresses (anonymized), edges are email exchanges.

  • Nodes: 32,430
  • Edges: 54,397
  • Category: Communication
  • Use case: Communication patterns, information flow, organizational structure

Basic Usage

Loading a Dataset

from graphforge import GraphForge
from graphforge.datasets import load_dataset

# Create a GraphForge instance
gf = GraphForge()

# Load a NetworkRepository dataset
load_dataset(gf, "netrepo-karate")

# Query the data
results = gf.execute("MATCH (n) RETURN count(n) AS node_count")
print(f"Loaded {results[0]['node_count'].value} nodes")

Discovering Datasets

from graphforge.datasets import list_datasets, get_dataset_info

# List all NetworkRepository datasets
all_datasets = list_datasets()
netrepo_datasets = [d for d in all_datasets if d.startswith("netrepo-")]
print(f"Found {len(netrepo_datasets)} NetworkRepository datasets")

# Get metadata for a specific dataset
info = get_dataset_info("netrepo-karate")
print(f"Dataset: {info.name}")
print(f"Description: {info.description}")
print(f"Nodes: {info.nodes}, Edges: {info.edges}")
print(f"Category: {info.category}")
print(f"Size: {info.size_mb} MB")

Filtering by Category

from graphforge.datasets import list_datasets, get_dataset_info

# Get all biological network datasets
all_datasets = list_datasets()
biological = [
    d for d in all_datasets
    if d.startswith("netrepo-") and get_dataset_info(d).category == "biological"
]

for dataset_name in biological:
    info = get_dataset_info(dataset_name)
    print(f"- {info.name}: {info.description}")

Advanced Usage

Analyzing Network Structure

from graphforge import GraphForge
from graphforge.datasets import load_dataset

gf = GraphForge()
load_dataset(gf, "netrepo-karate")

# Calculate degree distribution
degree_dist = gf.execute("""
    MATCH (n)
    OPTIONAL MATCH (n)-[r]-()
    WITH n, count(r) AS degree
    RETURN degree, count(n) AS frequency
    ORDER BY degree
""")

for row in degree_dist:
    print(f"Degree {row['degree'].value}: {row['frequency'].value} nodes")

Finding Communities

# Find triangles in the network (3-node cliques)
triangles = gf.execute("""
    MATCH (a)-[:CONNECTED_TO]-(b)-[:CONNECTED_TO]-(c)-[:CONNECTED_TO]-(a)
    WHERE id(a) < id(b) AND id(b) < id(c)
    RETURN count(*) AS triangle_count
""")

print(f"Found {triangles[0]['triangle_count'].value} triangles")

Comparing Datasets

from graphforge import GraphForge
from graphforge.datasets import load_dataset

def get_network_stats(dataset_name):
    gf = GraphForge()
    load_dataset(gf, dataset_name)

    # Count nodes and edges
    nodes = gf.execute("MATCH (n) RETURN count(n) AS count")[0]["count"].value
    edges = gf.execute("MATCH ()-[r]->() RETURN count(r) AS count")[0]["count"].value

    # Calculate average degree
    avg_degree = (2 * edges) / nodes if nodes > 0 else 0

    return {
        "dataset": dataset_name,
        "nodes": nodes,
        "edges": edges,
        "avg_degree": round(avg_degree, 2),
    }

# Compare several datasets
datasets = ["netrepo-karate", "netrepo-dolphins", "netrepo-football"]
for ds in datasets:
    stats = get_network_stats(ds)
    print(f"{stats['dataset']}: {stats['nodes']} nodes, {stats['edges']} edges, "
          f"avg degree {stats['avg_degree']}")

Data Format

NetworkRepository datasets are provided in GraphML format, an XML-based standard for graph data. GraphForge's GraphML loader automatically:

  • Parses node and edge attributes with type preservation
  • Converts GraphML types (int, double, boolean, string) to Cypher types
  • Handles compressed .graphml.gz files
  • Supports multi-label nodes
  • Preserves all property metadata

Example GraphML structure:

<graphml>
  <key id="d0" for="node" attr.name="label" attr.type="string"/>
  <key id="d1" for="node" attr.name="name" attr.type="string"/>

  <graph edgedefault="directed">
    <node id="n0">
      <data key="d0">Node</data>
      <data key="d1">Alice</data>
    </node>
    <node id="n1">
      <data key="d0">Node</data>
      <data key="d1">Bob</data>
    </node>
    <edge source="n0" target="n1"/>
  </graph>
</graphml>

Caching

Downloaded datasets are automatically cached in ~/.cache/graphforge/datasets/ to avoid redundant downloads:

from graphforge.datasets import load_dataset

# First load downloads the file
load_dataset(gf, "netrepo-karate")  # Downloads ~10 KB

# Second load uses cached file (instant)
load_dataset(gf, "netrepo-karate")  # Loads from cache

Performance

All NetworkRepository datasets in GraphForge are optimized for quick loading. All datasets validated for v0.3.0 release with 100% success rate.

Dataset Nodes Edges First Load Cached Load Validation
karate 34 78 0.00s 0.00s ✅ Passed
dolphins 62 159 0.00s 0.00s ✅ Passed
polbooks 105 441 0.00s 0.00s ✅ Passed
football 115 612 0.00s 0.00s ✅ Passed
lesmis 77 254 0.00s 0.00s ✅ Passed
celegans 453 2,025 0.01s 0.00s ✅ Passed
netscience 1,589 2,742 0.02s 0.01s ✅ Passed
jazz 198 2,742 0.01s 0.01s ✅ Passed
power 4,941 6,594 0.09s 0.08s ✅ Passed
email-eu 32,430 54,397 0.18s 0.17s ✅ Passed

Validation includes: - ✅ Download and caching functionality - ✅ Node and edge count accuracy - ✅ Sample query execution (MATCH, aggregations) - ✅ Performance benchmarks

Load times measured on modern laptop. Actual times may vary.

Persistence

Datasets can be loaded into persistent GraphForge instances:

from graphforge import GraphForge
from graphforge.datasets import load_dataset

# Load into persistent database
gf = GraphForge("karate.db")
load_dataset(gf, "netrepo-karate")
gf.close()

# Later, reopen and query (data persisted)
gf2 = GraphForge("karate.db")
results = gf2.execute("MATCH (n) RETURN count(n) AS count")
print(f"Loaded {results[0]['count'].value} nodes from persistent storage")
gf2.close()

Citation

If you use NetworkRepository datasets in your research, please cite:

Ryan A. Rossi and Nesreen K. Ahmed. (2015). The Network Data Repository with Interactive Graph Analytics and Visualization. In Proceedings of the Twenty-Ninth AAAI Conference on Artificial Intelligence. http://networkrepository.com/

Individual datasets may have their own citations. Check the NetworkRepository website for dataset-specific attribution requirements.

See Also