GraphForge Tutorial¶
A step-by-step guide to building, querying, and analyzing graphs with GraphForge.
Table of Contents¶
- Installation
- Your First Graph
- Querying with Cypher
- Working with Persistence
- Advanced Queries
- Real-World Example: Citation Network
- Best Practices
- Ranking and Clustering with forge.rank() and forge.cluster()
- Finding Relevant Content with forge.find()
- Next Steps
Installation¶
Install GraphForge using uv (recommended) or pip:
# Using uv
uv add graphforge
# Using pip
pip install graphforge
Verify the installation:
from graphforge import GraphForge
print("GraphForge installed successfully!")
Your First Graph¶
Let's build a simple social network.
Step 1: Create an In-Memory Graph¶
from graphforge import GraphForge
# Create an in-memory graph (data is not persisted between sessions)
forge = GraphForge()
Step 2: Add Nodes¶
Nodes represent entities in your graph. Use the Python API to create nodes:
# add_node returns a NodeHandle — prints as Person(id=1, name='Alice', age=30)
alice = forge.add_node("Person", name="Alice", age=30, city="NYC")
bob = forge.add_node("Person", name="Bob", age=25, city="NYC")
charlie = forge.add_node("Person", name="Charlie", age=35, city="LA")
print(f"Created node with ID: {alice.id}") # Auto-generated ID
Step 3: Add Relationships¶
Relationships connect nodes:
# add_edge connects two NodeHandles with a named relationship type
forge.add_edge(alice, "KNOWS", bob, since=2015, strength="strong")
forge.add_edge(alice, "KNOWS", charlie, since=2018, strength="medium")
Step 4: Query the Graph¶
Use Cypher queries to explore your graph. execute() returns an Apache Arrow Table:
# Find all people Alice knows
table = forge.execute("""
MATCH (a:Person)-[r:KNOWS]->(friend:Person)
WHERE a.name = 'Alice'
RETURN friend.name AS name, r.since AS since
ORDER BY r.since
""")
print("Alice knows:")
for row in table.to_pylist():
print(f" - {row['name']} (since {row['since']})")
Output:
Alice knows:
- Bob (since 2015)
- Charlie (since 2018)
Querying with Cypher¶
GraphForge supports the full openCypher language for declarative graph queries.
Every execute() call returns a PyArrow Table — use table.to_pandas(),
pl.from_arrow(table), or table.to_pylist() to work with the results.
Basic Pattern Matching¶
# Match all nodes
table = forge.execute("MATCH (n) RETURN n")
print(f"Total nodes: {table.num_rows}")
# Match nodes by label
table = forge.execute("""
MATCH (p:Person)
RETURN p.name AS name, p.age AS age
""")
for row in table.to_pylist():
print(f"{row['name']} is {row['age']} years old")
Filtering with WHERE¶
# Find people over 25
table = forge.execute("""
MATCH (p:Person)
WHERE p.age > 25
RETURN p.name AS name
""")
# Multiple conditions
table = forge.execute("""
MATCH (p:Person)
WHERE p.age > 25 AND p.city = 'NYC'
RETURN p.name AS name
""")
Traversing Relationships¶
# Find who knows whom
table = forge.execute("""
MATCH (a:Person)-[r:KNOWS]->(b:Person)
RETURN a.name AS from, b.name AS to, r.since AS since
""")
# Two-hop traversal
table = forge.execute("""
MATCH (a:Person)-[:KNOWS]->(:Person)-[:KNOWS]->(c:Person)
WHERE a.name = 'Alice'
RETURN c.name AS friend_of_friend
""")
Aggregations¶
# Count nodes
table = forge.execute("""
MATCH (p:Person)
RETURN count(*) AS total
""")
print(f"Total people: {table.column('total')[0].as_py()}")
# Group and aggregate
table = forge.execute("""
MATCH (p:Person)
RETURN p.city AS city, count(*) AS population
ORDER BY population DESC
""")
for row in table.to_pylist():
print(f"{row['city']}: {row['population']} people")
# Multiple aggregations
table = forge.execute("""
MATCH (p:Person)
RETURN
count(*) AS total,
avg(p.age) AS avg_age,
min(p.age) AS youngest,
max(p.age) AS oldest
""")
row = table.to_pylist()[0]
print(f"Total: {row['total']}")
print(f"Average age: {row['avg_age']:.1f}")
print(f"Age range: {row['youngest']}-{row['oldest']}")
Working with Persistence¶
So far, we've used in-memory graphs that disappear when the program exits. Pass a directory path to persist data to Parquet files on disk.
Creating a Persistent Graph¶
# Specify a directory path for Parquet-backed storage
forge = GraphForge("my-research-graph/")
# Add data (works the same as in-memory)
forge.execute("CREATE (p:Person {name: 'Alice', age: 30})")
forge.execute("CREATE (p:Person {name: 'Bob', age: 25})")
# Flush and close
forge.close()
Loading an Existing Graph¶
# Later, in a new session...
forge = GraphForge("my-research-graph/")
# Data is still there
table = forge.execute("MATCH (p:Person) RETURN p.name AS name")
for row in table.to_pylist():
print(f"Found: {row['name']}")
forge.close()
Incremental Updates¶
# Session 1: Initial data
forge = GraphForge("knowledge-base/")
forge.execute("CREATE (:Concept {name: 'Graph Databases'})")
forge.close()
# Session 2: Add more data
forge = GraphForge("knowledge-base/")
forge.execute("CREATE (:Concept {name: 'SQL Databases'})")
forge.execute("""
MATCH (gdb:Concept {name: 'Graph Databases'})
MATCH (sql:Concept {name: 'SQL Databases'})
CREATE (gdb)-[:DIFFERENT_FROM]->(sql)
""")
forge.close()
# Session 3: Query accumulated data
forge = GraphForge("knowledge-base/")
table = forge.execute("MATCH (c:Concept) RETURN count(*) AS count")
print(f"Total concepts: {table.column('count')[0].as_py()}") # 2
forge.close()
Advanced Queries¶
CREATE: Building Graphs with Cypher¶
You can use Cypher CREATE instead of the Python API:
# Create nodes
forge.execute("CREATE (p:Person {name: 'Alice', age: 30})")
# Create nodes with relationships in one statement
forge.execute("""
CREATE (a:Person {name: 'Alice'})-[:KNOWS {since: 2020}]->(b:Person {name: 'Bob'})
""")
# Create with RETURN
table = forge.execute("""
CREATE (p:Person {name: 'Charlie', age: 35})
RETURN p.name AS name, p.age AS age
""")
print(f"Created: {table.column('name')[0].as_py()}")
SET: Updating Properties¶
# Update single property
forge.execute("""
MATCH (p:Person {name: 'Alice'})
SET p.age = 31
""")
# Update multiple properties
forge.execute("""
MATCH (p:Person {name: 'Alice'})
SET p.age = 31, p.city = 'Boston', p.active = true
""")
# Update relationship properties
forge.execute("""
MATCH (a:Person {name: 'Alice'})-[r:KNOWS]->(b:Person {name: 'Bob'})
SET r.strength = 'strong'
""")
DELETE: Removing Data¶
# Delete a node (must have no relationships — use DETACH DELETE to also remove them)
forge.execute("""
MATCH (p:Person {name: 'Charlie'})
DETACH DELETE p
""")
# Delete only a relationship
forge.execute("""
MATCH (a:Person {name: 'Alice'})-[r:KNOWS]->(b:Person {name: 'Bob'})
DELETE r
""")
# Delete multiple elements
forge.execute("""
MATCH (p:Person)
WHERE p.age < 25
DELETE p
""")
MERGE: Idempotent Creation¶
MERGE creates nodes if they don't exist, or matches them if they do:
# Safe to run multiple times
forge.execute("MERGE (p:Person {name: 'Alice'})")
forge.execute("MERGE (p:Person {name: 'Alice'})") # Matches existing node
# Check result
table = forge.execute("MATCH (p:Person {name: 'Alice'}) RETURN count(*) AS count")
print(table.column("count")[0].as_py()) # 1, not 2!
# Useful for ETL pipelines
forge.execute("MERGE (p:Person {name: 'Bob', email: 'bob@example.com'})")
Real-World Example: Citation Network¶
Let's build a realistic citation network graph.
Setup¶
from graphforge import GraphForge
forge = GraphForge("citation-network/")
Load Papers¶
papers = [
{"id": "P1", "title": "Graph Neural Networks", "year": 2021, "citations": 150},
{"id": "P2", "title": "Deep Learning Fundamentals", "year": 2019, "citations": 500},
{"id": "P3", "title": "GNN Applications in NLP", "year": 2022, "citations": 80},
{"id": "P4", "title": "Attention Is All You Need", "year": 2017, "citations": 2000},
]
forge.add_nodes("Paper", papers)
print(f"Loaded {len(papers)} papers")
Add Authors¶
authors = [
{"name": "Alice Smith", "affiliation": "MIT"},
{"name": "Bob Jones", "affiliation": "Stanford"},
{"name": "Charlie Brown", "affiliation": "MIT"},
]
for author in authors:
forge.execute(f"""
MERGE (a:Author {{name: '{author['name']}'}})
SET a.affiliation = '{author['affiliation']}'
""")
Link Authors to Papers¶
authorships = [
("Alice Smith", "P1"),
("Alice Smith", "P3"),
("Bob Jones", "P2"),
("Charlie Brown", "P1"),
("Charlie Brown", "P4"),
]
for author_name, paper_id in authorships:
forge.execute(f"""
MATCH (a:Author {{name: '{author_name}'}})
MATCH (p:Paper {{id: '{paper_id}'}})
CREATE (a)-[:AUTHORED]->(p)
""")
Add Citation Links¶
citations = [
("P1", "P2"), # P1 cites P2
("P1", "P4"), # P1 cites P4
("P3", "P1"), # P3 cites P1
("P3", "P2"), # P3 cites P2
]
for citing_id, cited_id in citations:
forge.execute(f"""
MATCH (citing:Paper {{id: '{citing_id}'}})
MATCH (cited:Paper {{id: '{cited_id}'}})
CREATE (citing)-[:CITES]->(cited)
""")
Analysis 1: Most Prolific Authors¶
table = forge.execute("""
MATCH (a:Author)-[:AUTHORED]->(p:Paper)
RETURN a.name AS author, count(p) AS paper_count
ORDER BY paper_count DESC
""")
print("Most prolific authors:")
for row in table.to_pylist():
print(f" {row['author']}: {row['paper_count']} papers")
Output:
Most prolific authors:
Alice Smith: 2 papers
Charlie Brown: 2 papers
Bob Jones: 1 papers
Analysis 2: Most Cited Papers¶
table = forge.execute("""
MATCH (p:Paper)<-[:CITES]-(citing:Paper)
RETURN p.title AS paper, count(citing) AS citation_count
ORDER BY citation_count DESC
""")
print("\nMost cited papers (in-network):")
for row in table.to_pylist():
print(f" {row['paper']}: {row['citation_count']} citations")
Analysis 3: Collaboration Network¶
table = forge.execute("""
MATCH (a1:Author)-[:AUTHORED]->(p:Paper)<-[:AUTHORED]-(a2:Author)
WHERE a1.name < a2.name
RETURN a1.name AS author1, a2.name AS author2, count(p) AS papers
ORDER BY papers DESC
""")
print("\nAuthor collaborations:")
for row in table.to_pylist():
print(f" {row['author1']} & {row['author2']}: {row['papers']} papers")
Analysis 4: Papers by MIT Authors¶
table = forge.execute("""
MATCH (a:Author)-[:AUTHORED]->(p:Paper)
WHERE a.affiliation = 'MIT'
RETURN p.title AS paper, a.name AS author
""")
print("\nMIT papers:")
for row in table.to_pylist():
print(f" {row['paper']} by {row['author']}")
Cleanup¶
forge.close()
Best Practices¶
1. Choose Storage Mode Appropriately¶
Use in-memory graphs for: - Quick exploration and prototyping - Throwaway analyses - Testing
Use persistent graphs for: - Long-running analyses - Incremental graph construction - Shared datasets - Production workflows
# Exploration
forge = GraphForge()
# Persistent
forge = GraphForge("production-graph/")
2. Always Close Persistent Graphs¶
# Good: Using try-finally
forge = GraphForge("my-graph/")
try:
# ... work with graph ...
pass
finally:
forge.close()
3. Use MERGE for Idempotent Operations¶
# Safe to run multiple times
forge.execute("MERGE (p:Person {email: 'alice@example.com'})")
forge.execute("MERGE (p:Person {email: 'alice@example.com'})") # No duplicates
# Avoid this pattern
forge.execute("CREATE (p:Person {email: 'alice@example.com'})")
forge.execute("CREATE (p:Person {email: 'alice@example.com'})") # Creates duplicate!
4. Work with Arrow Results Directly¶
table = forge.execute("MATCH (p:Person) RETURN p.name AS name, p.age AS age")
# pandas
df = table.to_pandas()
# Polars
import polars as pl
df = pl.from_arrow(table)
# Plain Python iteration
for row in table.to_pylist():
name = row["name"]
age = row["age"]
5. Use WHERE for Complex Filtering¶
# Prefer this
forge.execute("""
MATCH (p:Person)
WHERE p.age > 25 AND p.city = 'NYC'
RETURN p
""")
6. Order Results Before Using LIMIT¶
# Always order when using LIMIT
forge.execute("""
MATCH (p:Person)
RETURN p.name AS name, p.age AS age
ORDER BY p.age DESC
LIMIT 10
""")
# Without ORDER BY, results are non-deterministic
Ranking and Clustering with forge.rank() and forge.cluster()¶
forge.rank() and forge.cluster() run compiled graph algorithms directly — no Cypher
needed, no exporting the graph first. Both return Arrow Tables and are read-only by default.
from graphforge import GraphForge
from graphforge.datasets import load_dataset
forge = GraphForge()
load_dataset(forge, "snap-ego-facebook")
# Score every node by PageRank — returns Arrow Table with a 'score' column
table = forge.rank("Node", by="pagerank")
df = table.to_pandas().sort_values("score", ascending=False)
print(df[["id", "score"]].head(10))
# Restrict to a specific relationship type and direction
table = forge.rank("Node", by="betweenness", via="KNOWS", directed=True)
# Opt-in write-back — stores the score as a node property for Cypher queries
forge.rank("Node", by="pagerank", write_property="pagerank")
top_nodes = forge.execute("""
MATCH (n)
RETURN n.id AS user, n.pagerank AS score
ORDER BY n.pagerank DESC LIMIT 10
""")
print(top_nodes.to_pandas())
# Community detection — returns Arrow Table with a 'community_id' column
table = forge.cluster("Node", by="louvain")
df = table.to_pandas()
print(df.groupby("community_id").size().sort_values(ascending=False).head(5))
# Write communities back and query the structure
forge.cluster("Node", by="louvain", write_property="community")
communities = forge.execute("""
MATCH (n)
RETURN n.community AS community, count(*) AS size
ORDER BY size DESC LIMIT 5
""")
print(communities.to_pandas())
Available algorithms:
| Method | by= value |
Category |
|---|---|---|
forge.rank() |
pagerank |
Centrality |
forge.rank() |
betweenness |
Centrality |
forge.rank() |
closeness |
Centrality |
forge.rank() |
degree |
Centrality |
forge.rank() |
clustering_coefficient |
Structural |
forge.rank() |
triangles |
Structural |
forge.cluster() |
louvain |
Community |
forge.cluster() |
components |
Community |
All methods accept optional via (relationship type), directed, and write_property
parameters. Without write_property, the graph is never mutated.
Finding Relevant Content with forge.find()¶
forge.find() provides full-text search, vector similarity, and hybrid search over node
properties. It returns an Arrow Table with node properties plus score and matched_on
columns. The index is built automatically on the first call.
Text Search¶
forge = GraphForge("citations/")
# Index built lazily on first call — no setup required
table = forge.find("graph neural networks", label="Paper", limit=10)
df = table.to_pandas()
print(df[["title", "score", "matched_on"]])
# title score matched_on
# 0 Graph Neural Networks 0.924 text
# 1 GNN Applications in NLP 0.781 text
Vector Search (bring your own embeddings)¶
GraphForge stores and queries vectors but does not generate them. Use any embedding model:
import openai
client = openai.OpenAI()
def embed(text: str) -> list[float]:
return client.embeddings.create(
input=text, model="text-embedding-3-small"
).data[0].embedding
# Store embeddings for each paper
rows = forge.execute("MATCH (n:Paper) RETURN id(n) AS nid, n.abstract AS abstract")
for row in rows.to_pylist():
vec = embed(row["abstract"] or "")
forge.index("Paper", node_id=row["nid"], vector=vec, space="text-embedding-3-small")
# Query by vector similarity
query_vec = embed("scalable graph representation learning")
table = forge.find(vector=query_vec, label="Paper", limit=10)
Hybrid Search¶
Combine text and vector signals in a single call:
table = forge.find("scalable graph learning", label="Paper", vector=query_vec, limit=10)
for row in table.to_pylist():
print(
row["title"],
f"score={row['score']:.3f}",
f"via={row['matched_on']}", # "text", "vector", or "text+vector"
)
Using Results in Cypher¶
Every row in the result table has an id column — pass it to execute() for follow-up
graph traversals:
table = forge.find("graph neural networks", label="Paper", limit=5)
top_id = table.column("id")[0].as_py()
neighbours = forge.execute("""
MATCH (p:Paper)-[:CITES]->(cited:Paper)
WHERE id(p) = $nid
RETURN cited.title AS title, cited.year AS year
ORDER BY cited.year DESC
""", {"nid": top_id})
print(neighbours.to_pandas())
Next Steps¶
Congratulations! You've learned the fundamentals of GraphForge.
Learn More¶
- API Reference — Complete Python API documentation
- Cypher Guide — Full openCypher subset reference
- Analytics Integration — Arrow, pandas, Polars, rank, cluster, find
- Architecture Overview — System design and internals
Try These Exercises¶
-
Social Network: Build a graph of friends and their relationships. Use
forge.rank()to find the most influential people. -
Knowledge Graph: Extract entities from a document and link them with relationships. Use
forge.find()to retrieve relevant context. -
Citation Analysis: Load a set of papers and citations. Rank papers by betweenness centrality to find bridging works.
-
Recommendation System: Build a user-item graph and find similar users or items using
forge.cluster()to identify taste communities. -
Data Lineage: Track transformations in a data pipeline and query dependencies.
Join the Community¶
- Report issues on GitHub
- Read the Requirements Document for design rationale
- Explore example notebooks (coming soon!)