LDBC Datasets¶

Status: Planned for v0.3.0 (Issue #51)

Support for LDBC (Linked Data Benchmark Council) benchmark datasets.

Overview¶

The LDBC Social Network Benchmark (SNB) is a comprehensive benchmark for graph databases, featuring realistic social network data with complex queries.

Dataset Source¶

URL: https://ldbcouncil.org/resources/datasets/
License: Varies by dataset
Category: Benchmark, Social Network

Available Datasets¶

Realistic social network data with various scale factors:

Scale Factor	Nodes	Edges	Size	Description
SF0.001	~3K	~17K	~2 MB	Tiny (testing)
SF0.003	~9K	~52K	~5 MB	Small (development)
SF0.01	~30K	~177K	~17 MB	Medium (testing)
SF0.1	~328K	~1.8M	~180 MB	Standard (benchmarking)
SF1	~3.2M	~18M	~1.8 GB	Large (performance testing)

Schema¶

The LDBC SNB schema includes:

Node Labels: - Person - Social network users - Post - User posts - Comment - Comments on posts - Forum - Discussion forums - Organisation - Companies and universities - Place - Cities and countries - Tag - Content tags - TagClass - Tag categories

Relationship Types: - KNOWS - Person to person connections - LIKES - Likes on posts/comments - HAS_CREATOR - Content authorship - REPLY_OF - Comment replies - HAS_TAG - Content tagging - IS_LOCATED_IN - Location relationships - And more...

Usage¶

from graphforge import GraphForge
from graphforge.datasets import load_dataset

# Load a small LDBC dataset
gf = GraphForge()
load_dataset(gf, "ldbc-snb-sf001")

# Explore the social network
results = gf.execute("""
    MATCH (p:Person)-[:KNOWS]->(friend:Person)
    RETURN p.firstName, p.lastName, count(friend) AS friendCount
    ORDER BY friendCount DESC
    LIMIT 10
""")

Example Queries¶

MATCH (post:Post)<-[:LIKES]-(liker:Person)
RETURN post.content, count(liker) AS likes
ORDER BY likes DESC
LIMIT 10

Community Detection¶

MATCH (p:Person)-[:KNOWS*2]-(friend:Person)
WHERE p <> friend
RETURN p.firstName, p.lastName, count(DISTINCT friend) AS secondDegreeConnections
ORDER BY secondDegreeConnections DESC
LIMIT 20

Content Analysis¶

MATCH (tag:Tag)<-[:HAS_TAG]-(post:Post)
RETURN tag.name, count(post) AS postCount
ORDER BY postCount DESC
LIMIT 15

Benchmarking¶

LDBC SNB is designed for benchmarking with standardized queries:

from graphforge import GraphForge
from graphforge.datasets import load_dataset
import time

gf = GraphForge()
load_dataset(gf, "ldbc-snb-sf01")

# Benchmark query execution
start = time.time()
results = gf.execute("""
    MATCH (person:Person)-[:KNOWS*1..2]-(friend:Person)
    WHERE person.id = 123
    RETURN DISTINCT friend.firstName, friend.lastName
""")
elapsed = time.time() - start

print(f"Query executed in {elapsed:.3f} seconds")
print(f"Results: {len(results)} friends found")

CLI Usage¶

# List available LDBC datasets
graphforge list-datasets --source ldbc

# Load a specific scale factor
graphforge load-dataset ldbc-snb-sf001

# Show dataset information
graphforge dataset-info ldbc-snb-sf01

Performance Notes¶

SF0.001-0.01: Suitable for development and testing
SF0.1: Standard benchmarking scale
SF1+: Requires significant RAM (8GB+ recommended)
Use SQLite persistence for large scale factors

References¶

Dataset Overview

Issue #51