Skip to content

LDBC Datasets

Status: Planned for v0.3.0 (Issue #51)

Support for LDBC (Linked Data Benchmark Council) benchmark datasets.

Overview

The LDBC Social Network Benchmark (SNB) is a comprehensive benchmark for graph databases, featuring realistic social network data with complex queries.

Dataset Source

  • URL: https://ldbcouncil.org/resources/datasets/
  • License: Varies by dataset
  • Category: Benchmark, Social Network

Available Datasets

LDBC SNB (Social Network Benchmark)

Realistic social network data with various scale factors:

Scale Factor Nodes Edges Size Description
SF0.001 ~3K ~17K ~2 MB Tiny (testing)
SF0.003 ~9K ~52K ~5 MB Small (development)
SF0.01 ~30K ~177K ~17 MB Medium (testing)
SF0.1 ~328K ~1.8M ~180 MB Standard (benchmarking)
SF1 ~3.2M ~18M ~1.8 GB Large (performance testing)

Schema

The LDBC SNB schema includes:

Node Labels: - Person - Social network users - Post - User posts - Comment - Comments on posts - Forum - Discussion forums - Organisation - Companies and universities - Place - Cities and countries - Tag - Content tags - TagClass - Tag categories

Relationship Types: - KNOWS - Person to person connections - LIKES - Likes on posts/comments - HAS_CREATOR - Content authorship - REPLY_OF - Comment replies - HAS_TAG - Content tagging - IS_LOCATED_IN - Location relationships - And more...

Usage

from graphforge import GraphForge
from graphforge.datasets import load_dataset

# Load a small LDBC dataset
gf = GraphForge()
load_dataset(gf, "ldbc-snb-sf001")

# Explore the social network
results = gf.execute("""
    MATCH (p:Person)-[:KNOWS]->(friend:Person)
    RETURN p.firstName, p.lastName, count(friend) AS friendCount
    ORDER BY friendCount DESC
    LIMIT 10
""")

Example Queries

MATCH (post:Post)<-[:LIKES]-(liker:Person)
RETURN post.content, count(liker) AS likes
ORDER BY likes DESC
LIMIT 10

Community Detection

MATCH (p:Person)-[:KNOWS*2]-(friend:Person)
WHERE p <> friend
RETURN p.firstName, p.lastName, count(DISTINCT friend) AS secondDegreeConnections
ORDER BY secondDegreeConnections DESC
LIMIT 20

Content Analysis

MATCH (tag:Tag)<-[:HAS_TAG]-(post:Post)
RETURN tag.name, count(post) AS postCount
ORDER BY postCount DESC
LIMIT 15

Benchmarking

LDBC SNB is designed for benchmarking with standardized queries:

from graphforge import GraphForge
from graphforge.datasets import load_dataset
import time

gf = GraphForge()
load_dataset(gf, "ldbc-snb-sf01")

# Benchmark query execution
start = time.time()
results = gf.execute("""
    MATCH (person:Person)-[:KNOWS*1..2]-(friend:Person)
    WHERE person.id = 123
    RETURN DISTINCT friend.firstName, friend.lastName
""")
elapsed = time.time() - start

print(f"Query executed in {elapsed:.3f} seconds")
print(f"Results: {len(results)} friends found")

CLI Usage

# List available LDBC datasets
graphforge list-datasets --source ldbc

# Load a specific scale factor
graphforge load-dataset ldbc-snb-sf001

# Show dataset information
graphforge dataset-info ldbc-snb-sf01

Performance Notes

  • SF0.001-0.01: Suitable for development and testing
  • SF0.1: Standard benchmarking scale
  • SF1+: Requires significant RAM (8GB+ recommended)
  • Use SQLite persistence for large scale factors

References