GraphForge Tutorial¶

A step-by-step guide to building, querying, and analyzing graphs with GraphForge.

Table of Contents¶

Installation
Your First Graph
Querying with Cypher
Working with Persistence
Advanced Queries
Real-World Example: Citation Network
Best Practices
Ranking and Clustering with forge.rank() and forge.cluster()
Finding Relevant Content with forge.find()
Next Steps

Installation¶

Install GraphForge using uv (recommended) or pip:

# Using uv
uv add graphforge

# Using pip
pip install graphforge

Verify the installation:

from graphforge import GraphForge
print("GraphForge installed successfully!")

Your First Graph¶

Let's build a simple social network.

Step 1: Create an In-Memory Graph¶

from graphforge import GraphForge

# Create an in-memory graph (data is not persisted between sessions)
forge = GraphForge()

Step 2: Add Nodes¶

Nodes represent entities in your graph. Use the Python API to create nodes:

# add_node returns a NodeHandle — prints as Person(id=1, name='Alice', age=30)
alice = forge.add_node("Person", name="Alice", age=30, city="NYC")
bob   = forge.add_node("Person", name="Bob",   age=25, city="NYC")
charlie = forge.add_node("Person", name="Charlie", age=35, city="LA")

print(f"Created node with ID: {alice.id}")  # Auto-generated ID

Step 3: Add Relationships¶

Relationships connect nodes:

# add_edge connects two NodeHandles with a named relationship type
forge.add_edge(alice, "KNOWS", bob,     since=2015, strength="strong")
forge.add_edge(alice, "KNOWS", charlie, since=2018, strength="medium")

Step 4: Query the Graph¶

Use Cypher queries to explore your graph. execute() returns an Apache Arrow Table:

# Find all people Alice knows
table = forge.execute("""
    MATCH (a:Person)-[r:KNOWS]->(friend:Person)
    WHERE a.name = 'Alice'
    RETURN friend.name AS name, r.since AS since
    ORDER BY r.since
""")

print("Alice knows:")
for row in table.to_pylist():
    print(f"  - {row['name']} (since {row['since']})")

Output:

Alice knows:
  - Bob (since 2015)
  - Charlie (since 2018)

Querying with Cypher¶

GraphForge supports the full openCypher language for declarative graph queries. Every execute() call returns a PyArrow Table — use table.to_pandas(), pl.from_arrow(table), or table.to_pylist() to work with the results.

Basic Pattern Matching¶

# Match all nodes
table = forge.execute("MATCH (n) RETURN n")
print(f"Total nodes: {table.num_rows}")

# Match nodes by label
table = forge.execute("""
    MATCH (p:Person)
    RETURN p.name AS name, p.age AS age
""")

for row in table.to_pylist():
    print(f"{row['name']} is {row['age']} years old")

Filtering with WHERE¶

# Find people over 25
table = forge.execute("""
    MATCH (p:Person)
    WHERE p.age > 25
    RETURN p.name AS name
""")

# Multiple conditions
table = forge.execute("""
    MATCH (p:Person)
    WHERE p.age > 25 AND p.city = 'NYC'
    RETURN p.name AS name
""")

Traversing Relationships¶

# Find who knows whom
table = forge.execute("""
    MATCH (a:Person)-[r:KNOWS]->(b:Person)
    RETURN a.name AS from, b.name AS to, r.since AS since
""")

# Two-hop traversal
table = forge.execute("""
    MATCH (a:Person)-[:KNOWS]->(:Person)-[:KNOWS]->(c:Person)
    WHERE a.name = 'Alice'
    RETURN c.name AS friend_of_friend
""")

Aggregations¶

# Count nodes
table = forge.execute("""
    MATCH (p:Person)
    RETURN count(*) AS total
""")
print(f"Total people: {table.column('total')[0].as_py()}")

# Group and aggregate
table = forge.execute("""
    MATCH (p:Person)
    RETURN p.city AS city, count(*) AS population
    ORDER BY population DESC
""")

for row in table.to_pylist():
    print(f"{row['city']}: {row['population']} people")

# Multiple aggregations
table = forge.execute("""
    MATCH (p:Person)
    RETURN
        count(*) AS total,
        avg(p.age) AS avg_age,
        min(p.age) AS youngest,
        max(p.age) AS oldest
""")

row = table.to_pylist()[0]
print(f"Total: {row['total']}")
print(f"Average age: {row['avg_age']:.1f}")
print(f"Age range: {row['youngest']}-{row['oldest']}")

Working with Persistence¶

So far, we've used in-memory graphs that disappear when the program exits. Pass a directory path to persist data to Parquet files on disk.

Creating a Persistent Graph¶

# Specify a directory path for Parquet-backed storage
forge = GraphForge("my-research-graph/")

# Add data (works the same as in-memory)
forge.execute("CREATE (p:Person {name: 'Alice', age: 30})")
forge.execute("CREATE (p:Person {name: 'Bob', age: 25})")

# Flush and close
forge.close()

Loading an Existing Graph¶

# Later, in a new session...
forge = GraphForge("my-research-graph/")

# Data is still there
table = forge.execute("MATCH (p:Person) RETURN p.name AS name")
for row in table.to_pylist():
    print(f"Found: {row['name']}")

forge.close()

Incremental Updates¶

# Session 1: Initial data
forge = GraphForge("knowledge-base/")
forge.execute("CREATE (:Concept {name: 'Graph Databases'})")
forge.close()

# Session 2: Add more data
forge = GraphForge("knowledge-base/")
forge.execute("CREATE (:Concept {name: 'SQL Databases'})")
forge.execute("""
    MATCH (gdb:Concept {name: 'Graph Databases'})
    MATCH (sql:Concept {name: 'SQL Databases'})
    CREATE (gdb)-[:DIFFERENT_FROM]->(sql)
""")
forge.close()

# Session 3: Query accumulated data
forge = GraphForge("knowledge-base/")
table = forge.execute("MATCH (c:Concept) RETURN count(*) AS count")
print(f"Total concepts: {table.column('count')[0].as_py()}")  # 2
forge.close()

Advanced Queries¶

CREATE: Building Graphs with Cypher¶

You can use Cypher CREATE instead of the Python API:

# Create nodes
forge.execute("CREATE (p:Person {name: 'Alice', age: 30})")

# Create nodes with relationships in one statement
forge.execute("""
    CREATE (a:Person {name: 'Alice'})-[:KNOWS {since: 2020}]->(b:Person {name: 'Bob'})
""")

# Create with RETURN
table = forge.execute("""
    CREATE (p:Person {name: 'Charlie', age: 35})
    RETURN p.name AS name, p.age AS age
""")
print(f"Created: {table.column('name')[0].as_py()}")

SET: Updating Properties¶

# Update single property
forge.execute("""
    MATCH (p:Person {name: 'Alice'})
    SET p.age = 31
""")

# Update multiple properties
forge.execute("""
    MATCH (p:Person {name: 'Alice'})
    SET p.age = 31, p.city = 'Boston', p.active = true
""")

# Update relationship properties
forge.execute("""
    MATCH (a:Person {name: 'Alice'})-[r:KNOWS]->(b:Person {name: 'Bob'})
    SET r.strength = 'strong'
""")

DELETE: Removing Data¶

# Delete a node (must have no relationships — use DETACH DELETE to also remove them)
forge.execute("""
    MATCH (p:Person {name: 'Charlie'})
    DETACH DELETE p
""")

# Delete only a relationship
forge.execute("""
    MATCH (a:Person {name: 'Alice'})-[r:KNOWS]->(b:Person {name: 'Bob'})
    DELETE r
""")

# Delete multiple elements
forge.execute("""
    MATCH (p:Person)
    WHERE p.age < 25
    DELETE p
""")

MERGE: Idempotent Creation¶

MERGE creates nodes if they don't exist, or matches them if they do:

# Safe to run multiple times
forge.execute("MERGE (p:Person {name: 'Alice'})")
forge.execute("MERGE (p:Person {name: 'Alice'})")  # Matches existing node

# Check result
table = forge.execute("MATCH (p:Person {name: 'Alice'}) RETURN count(*) AS count")
print(table.column("count")[0].as_py())  # 1, not 2!

# Useful for ETL pipelines
forge.execute("MERGE (p:Person {name: 'Bob', email: 'bob@example.com'})")

Real-World Example: Citation Network¶

Let's build a realistic citation network graph.

Setup¶

from graphforge import GraphForge

forge = GraphForge("citation-network/")

Load Papers¶

papers = [
    {"id": "P1", "title": "Graph Neural Networks", "year": 2021, "citations": 150},
    {"id": "P2", "title": "Deep Learning Fundamentals", "year": 2019, "citations": 500},
    {"id": "P3", "title": "GNN Applications in NLP", "year": 2022, "citations": 80},
    {"id": "P4", "title": "Attention Is All You Need", "year": 2017, "citations": 2000},
]

forge.add_nodes("Paper", papers)
print(f"Loaded {len(papers)} papers")

Add Authors¶

authors = [
    {"name": "Alice Smith", "affiliation": "MIT"},
    {"name": "Bob Jones", "affiliation": "Stanford"},
    {"name": "Charlie Brown", "affiliation": "MIT"},
]

for author in authors:
    forge.execute(f"""
        MERGE (a:Author {{name: '{author['name']}'}})
        SET a.affiliation = '{author['affiliation']}'
    """)

Link Authors to Papers¶

authorships = [
    ("Alice Smith", "P1"),
    ("Alice Smith", "P3"),
    ("Bob Jones", "P2"),
    ("Charlie Brown", "P1"),
    ("Charlie Brown", "P4"),
]

for author_name, paper_id in authorships:
    forge.execute(f"""
        MATCH (a:Author {{name: '{author_name}'}})
        MATCH (p:Paper {{id: '{paper_id}'}})
        CREATE (a)-[:AUTHORED]->(p)
    """)

Add Citation Links¶

citations = [
    ("P1", "P2"),  # P1 cites P2
    ("P1", "P4"),  # P1 cites P4
    ("P3", "P1"),  # P3 cites P1
    ("P3", "P2"),  # P3 cites P2
]

for citing_id, cited_id in citations:
    forge.execute(f"""
        MATCH (citing:Paper {{id: '{citing_id}'}})
        MATCH (cited:Paper {{id: '{cited_id}'}})
        CREATE (citing)-[:CITES]->(cited)
    """)

Analysis 1: Most Prolific Authors¶

table = forge.execute("""
    MATCH (a:Author)-[:AUTHORED]->(p:Paper)
    RETURN a.name AS author, count(p) AS paper_count
    ORDER BY paper_count DESC
""")

print("Most prolific authors:")
for row in table.to_pylist():
    print(f"  {row['author']}: {row['paper_count']} papers")

Output:

Most prolific authors:
  Alice Smith: 2 papers
  Charlie Brown: 2 papers
  Bob Jones: 1 papers

Analysis 2: Most Cited Papers¶

table = forge.execute("""
    MATCH (p:Paper)<-[:CITES]-(citing:Paper)
    RETURN p.title AS paper, count(citing) AS citation_count
    ORDER BY citation_count DESC
""")

print("\nMost cited papers (in-network):")
for row in table.to_pylist():
    print(f"  {row['paper']}: {row['citation_count']} citations")

Analysis 3: Collaboration Network¶

table = forge.execute("""
    MATCH (a1:Author)-[:AUTHORED]->(p:Paper)<-[:AUTHORED]-(a2:Author)
    WHERE a1.name < a2.name
    RETURN a1.name AS author1, a2.name AS author2, count(p) AS papers
    ORDER BY papers DESC
""")

print("\nAuthor collaborations:")
for row in table.to_pylist():
    print(f"  {row['author1']} & {row['author2']}: {row['papers']} papers")

Analysis 4: Papers by MIT Authors¶

table = forge.execute("""
    MATCH (a:Author)-[:AUTHORED]->(p:Paper)
    WHERE a.affiliation = 'MIT'
    RETURN p.title AS paper, a.name AS author
""")

print("\nMIT papers:")
for row in table.to_pylist():
    print(f"  {row['paper']} by {row['author']}")

Cleanup¶

forge.close()

Best Practices¶

1. Choose Storage Mode Appropriately¶

Use in-memory graphs for: - Quick exploration and prototyping - Throwaway analyses - Testing

Use persistent graphs for: - Long-running analyses - Incremental graph construction - Shared datasets - Production workflows

# Exploration
forge = GraphForge()

# Persistent
forge = GraphForge("production-graph/")

2. Always Close Persistent Graphs¶

# Good: Using try-finally
forge = GraphForge("my-graph/")
try:
    # ... work with graph ...
    pass
finally:
    forge.close()

3. Use MERGE for Idempotent Operations¶

# Safe to run multiple times
forge.execute("MERGE (p:Person {email: 'alice@example.com'})")
forge.execute("MERGE (p:Person {email: 'alice@example.com'})")  # No duplicates

# Avoid this pattern
forge.execute("CREATE (p:Person {email: 'alice@example.com'})")
forge.execute("CREATE (p:Person {email: 'alice@example.com'})")  # Creates duplicate!

4. Work with Arrow Results Directly¶

table = forge.execute("MATCH (p:Person) RETURN p.name AS name, p.age AS age")

# pandas
df = table.to_pandas()

# Polars
import polars as pl
df = pl.from_arrow(table)

# Plain Python iteration
for row in table.to_pylist():
    name = row["name"]
    age  = row["age"]

5. Use WHERE for Complex Filtering¶

# Prefer this
forge.execute("""
    MATCH (p:Person)
    WHERE p.age > 25 AND p.city = 'NYC'
    RETURN p
""")

6. Order Results Before Using LIMIT¶

# Always order when using LIMIT
forge.execute("""
    MATCH (p:Person)
    RETURN p.name AS name, p.age AS age
    ORDER BY p.age DESC
    LIMIT 10
""")

# Without ORDER BY, results are non-deterministic

Ranking and Clustering with forge.rank() and forge.cluster()¶

forge.rank() and forge.cluster() run compiled graph algorithms directly — no Cypher needed, no exporting the graph first. Both return Arrow Tables and are read-only by default.

from graphforge import GraphForge
from graphforge.datasets import load_dataset

forge = GraphForge()
load_dataset(forge, "snap-ego-facebook")

# Score every node by PageRank — returns Arrow Table with a 'score' column
table = forge.rank("Node", by="pagerank")
df = table.to_pandas().sort_values("score", ascending=False)
print(df[["id", "score"]].head(10))

# Restrict to a specific relationship type and direction
table = forge.rank("Node", by="betweenness", via="KNOWS", directed=True)

# Opt-in write-back — stores the score as a node property for Cypher queries
forge.rank("Node", by="pagerank", write_property="pagerank")
top_nodes = forge.execute("""
    MATCH (n)
    RETURN n.id AS user, n.pagerank AS score
    ORDER BY n.pagerank DESC LIMIT 10
""")
print(top_nodes.to_pandas())

# Community detection — returns Arrow Table with a 'community_id' column
table = forge.cluster("Node", by="louvain")
df = table.to_pandas()
print(df.groupby("community_id").size().sort_values(ascending=False).head(5))

# Write communities back and query the structure
forge.cluster("Node", by="louvain", write_property="community")
communities = forge.execute("""
    MATCH (n)
    RETURN n.community AS community, count(*) AS size
    ORDER BY size DESC LIMIT 5
""")
print(communities.to_pandas())

Available algorithms:

Method	`by=` value	Category
`forge.rank()`	`pagerank`	Centrality
`forge.rank()`	`betweenness`	Centrality
`forge.rank()`	`closeness`	Centrality
`forge.rank()`	`degree`	Centrality
`forge.rank()`	`clustering_coefficient`	Structural
`forge.rank()`	`triangles`	Structural
`forge.cluster()`	`louvain`	Community
`forge.cluster()`	`components`	Community

All methods accept optional via (relationship type), directed, and write_property parameters. Without write_property, the graph is never mutated.

Finding Relevant Content with forge.find()¶

forge.find() provides full-text search, vector similarity, and hybrid search over node properties. It returns an Arrow Table with node properties plus score and matched_on columns. The index is built automatically on the first call.

Text Search¶

forge = GraphForge("citations/")

# Index built lazily on first call — no setup required
table = forge.find("graph neural networks", label="Paper", limit=10)
df = table.to_pandas()
print(df[["title", "score", "matched_on"]])
#                         title     score matched_on
# 0       Graph Neural Networks  0.924       text
# 1   GNN Applications in NLP   0.781       text

Vector Search (bring your own embeddings)¶

GraphForge stores and queries vectors but does not generate them. Use any embedding model:

import openai
client = openai.OpenAI()

def embed(text: str) -> list[float]:
    return client.embeddings.create(
        input=text, model="text-embedding-3-small"
    ).data[0].embedding

# Store embeddings for each paper
rows = forge.execute("MATCH (n:Paper) RETURN id(n) AS nid, n.abstract AS abstract")
for row in rows.to_pylist():
    vec = embed(row["abstract"] or "")
    forge.index("Paper", node_id=row["nid"], vector=vec, space="text-embedding-3-small")

# Query by vector similarity
query_vec = embed("scalable graph representation learning")
table = forge.find(vector=query_vec, label="Paper", limit=10)

Hybrid Search¶

Combine text and vector signals in a single call:

table = forge.find("scalable graph learning", label="Paper", vector=query_vec, limit=10)

for row in table.to_pylist():
    print(
        row["title"],
        f"score={row['score']:.3f}",
        f"via={row['matched_on']}",  # "text", "vector", or "text+vector"
    )

Using Results in Cypher¶

Every row in the result table has an id column — pass it to execute() for follow-up graph traversals:

table = forge.find("graph neural networks", label="Paper", limit=5)
top_id = table.column("id")[0].as_py()

neighbours = forge.execute("""
    MATCH (p:Paper)-[:CITES]->(cited:Paper)
    WHERE id(p) = $nid
    RETURN cited.title AS title, cited.year AS year
    ORDER BY cited.year DESC
""", {"nid": top_id})
print(neighbours.to_pandas())

Next Steps¶

Congratulations! You've learned the fundamentals of GraphForge.

Learn More¶

API Reference — Complete Python API documentation
Cypher Guide — Full openCypher subset reference
Analytics Integration — Arrow, pandas, Polars, rank, cluster, find
Architecture Overview — System design and internals

Try These Exercises¶

Social Network: Build a graph of friends and their relationships. Use forge.rank() to find the most influential people.
Knowledge Graph: Extract entities from a document and link them with relationships. Use forge.find() to retrieve relevant context.
Citation Analysis: Load a set of papers and citations. Rank papers by betweenness centrality to find bridging works.
Recommendation System: Build a user-item graph and find similar users or items using forge.cluster() to identify taste communities.
Data Lineage: Track transformations in a data pipeline and query dependencies.

Join the Community¶

Report issues on GitHub
Read the Requirements Document for design rationale
Explore example notebooks (coming soon!)