Cypher Script Loading¶

GraphForge can load .cypher and .cql script files commonly used for Neo4j example datasets and data imports. The CypherLoader automatically handles schema-related statements that aren't needed for embedded database use.

Overview¶

The CypherLoader is designed to: - Parse multi-statement Cypher scripts - Execute data manipulation statements (CREATE, MERGE, SET, etc.) - Automatically skip schema operations that aren't required for embedded use - Provide transparent logging of skipped statements

Basic Usage¶

Loading a Script File¶

from graphforge import GraphForge
from graphforge.datasets.loaders import CypherLoader
from pathlib import Path

# Create a GraphForge instance
gf = GraphForge()

# Create a loader and load the script
loader = CypherLoader()
loader.load(gf, Path("data/movies.cypher"))

# Query the loaded data
results = gf.execute("MATCH (m:Movie) RETURN m.title")

Example Script¶

// Create constraint (automatically skipped)
CREATE CONSTRAINT movie_title IF NOT EXISTS
FOR (m:Movie) REQUIRE m.title IS UNIQUE;

// Create index (automatically skipped)
CREATE INDEX person_name IF NOT EXISTS FOR (p:Person) ON (p.name);

// Create data (executed)
CREATE (TheMatrix:Movie {title:'The Matrix', released:1999});
CREATE (Keanu:Person {name:'Keanu Reeves', born:1964});

// Create relationship (executed)
MATCH (keanu:Person {name:'Keanu Reeves'}), (matrix:Movie {title:'The Matrix'})
CREATE (keanu)-[:ACTED_IN]->(matrix);

Supported Statements¶

✅ Fully Supported (Executed)¶

All standard Cypher query statements are executed:

CREATE - Node and relationship creation
MERGE - Create or match patterns with ON CREATE/MATCH SET
MATCH - Pattern matching
SET - Property updates
DELETE / DETACH DELETE - Node and relationship deletion
REMOVE - Property and label removal
WITH - Query chaining
WHERE - Filtering
RETURN - Result projection
ORDER BY, LIMIT, SKIP - Result ordering and pagination
UNWIND - List expansion

⏭️ Automatically Skipped (Schema Operations)¶

The following statements are automatically skipped because they're not needed for embedded database use:

CREATE CONSTRAINT / DROP CONSTRAINT¶

CREATE CONSTRAINT movie_title IF NOT EXISTS
FOR (m:Movie) REQUIRE m.title IS UNIQUE;

DROP CONSTRAINT movie_title IF EXISTS;

Why skipped: GraphForge is an embedded database designed for analysis and prototyping, not production use. Schema validation can be done in Python application code if needed.

Supported constraint types (all skipped): - UNIQUE constraints - Uniqueness validation - EXISTS / IS NOT NULL constraints - Required properties - NODE KEY constraints - Composite keys

Impact: None. Data loads successfully without constraints. If you need validation, implement it in your Python code.

Related: Issue #59

CREATE INDEX / DROP INDEX¶

CREATE INDEX person_name IF NOT EXISTS FOR (p:Person) ON (p.name);
CREATE INDEX movie_released FOR (m:Movie) ON (m.released);

DROP INDEX person_name IF EXISTS;

Why skipped: SQLite (GraphForge's storage backend) automatically creates indexes as needed. Explicit index creation is not exposed in the API.

Supported index types (all skipped): - Single property indexes - Composite indexes - Full-text indexes - Spatial indexes

Impact: Minor. Queries work but may be slower on very large datasets (>1M nodes). For small to medium datasets (<1M nodes), automatic SQLite indexing is sufficient.

Related: Issue #62

CALL Procedures¶

CALL db.labels() YIELD label;
CALL apoc.periodic.iterate(...);
CALL db.index.fulltext.createIndex(...);

Why skipped: GraphForge doesn't have a stored procedure system. Most operations that would use procedures can be done more naturally in Python.

Common procedure patterns: - CALL db.labels() → Use Python: gf.get_labels() - CALL db.relationshipTypes() → Use Python: gf.get_relationship_types() - CALL apoc.* → Implement in Python or use NetworkX

Impact: Low. Procedures are rarely used in data loading scripts. Most CALL statements are for database administration or advanced operations that aren't needed for data imports.

Related: Issue #63

Script Format¶

Statement Separation¶

Statements are separated by semicolons (;):

CREATE (a:Person {name: 'Alice'});
CREATE (b:Person {name: 'Bob'});
CREATE (a)-[:KNOWS]->(b);

Comments¶

Single-line comments are supported:

// This is a comment
CREATE (n:Node {id: 1}); // Inline comment

URLs are preserved: The loader correctly handles URLs with // in property values:

CREATE (w:Website {url: 'https://example.com'}); // This comment is removed

Multi-line Statements¶

Statements can span multiple lines:

CREATE (matrix:Movie {
    title: 'The Matrix',
    released: 1999,
    tagline: 'Welcome to the Real World'
});

Empty Statements¶

Multiple semicolons and empty statements are handled gracefully:

CREATE (a:A);;;
CREATE (b:B);;
// Result: Two nodes created, empty statements ignored

Logging¶

The CypherLoader provides transparent logging at different levels:

DEBUG Level¶

View skipped statements:

import logging
logging.basicConfig(level=logging.DEBUG)

loader = CypherLoader()
loader.load(gf, Path("movies.cypher"))

Output:

DEBUG:graphforge.datasets.loaders.cypher:Skipping unsupported statement: CREATE CONSTRAINT movie_title IF NOT EXISTS...
DEBUG:graphforge.datasets.loaders.cypher:Skipping unsupported statement: CREATE INDEX person_name IF NOT EXISTS...

INFO Level¶

View execution summary:

import logging
logging.basicConfig(level=logging.INFO)

loader = CypherLoader()
loader.load(gf, Path("movies.cypher"))

Output:

INFO:graphforge.datasets.loaders.cypher:Loading Cypher script: movies.cypher
INFO:graphforge.datasets.loaders.cypher:Loaded movies.cypher: 15 statements executed, 3 skipped

Error Handling¶

File Not Found¶

loader.load(gf, Path("nonexistent.cypher"))
# Raises: FileNotFoundError: Cypher script not found: nonexistent.cypher

Syntax Errors¶

If a statement has invalid Cypher syntax, loading stops with a detailed error:

loader.load(gf, Path("invalid.cypher"))
# Raises: ValueError: Cypher execution error: No terminal matches 'X' in the current parser context

The error message includes: - The line and column where the error occurred - The problematic statement (first 100 characters) - The expected tokens

Best Practices¶

1. Use Comments Liberally¶

Document your scripts for maintainability:

// ============================================
// Movie Graph Dataset
// Source: Neo4j Examples
// Last Updated: 2024-01-15
// ============================================

// Create movies
CREATE (TheMatrix:Movie {title:'The Matrix', released:1999});
CREATE (TheMatrixReloaded:Movie {title:'The Matrix Reloaded', released:2003});

Organize scripts by logical sections:

// ======================
// Schema (auto-skipped)
// ======================
CREATE CONSTRAINT movie_title FOR (m:Movie) REQUIRE m.title IS UNIQUE;

// ======================
// Nodes
// ======================
CREATE (m1:Movie {title:'The Matrix'});
CREATE (m2:Movie {title:'The Matrix Reloaded'});

// ======================
// Relationships
// ======================
MATCH (m1:Movie {title:'The Matrix'}), (m2:Movie {title:'The Matrix Reloaded'})
CREATE (m1)-[:SEQUEL]->(m2);

3. Use Variables for Complex Patterns¶

Make scripts more readable:

// Create nodes
CREATE (neo:Character {name:'Neo'});
CREATE (morpheus:Character {name:'Morpheus'});
CREATE (matrix:Movie {title:'The Matrix'});

// Create relationships using variables
MATCH (neo:Character {name:'Neo'}),
      (morpheus:Character {name:'Morpheus'}),
      (matrix:Movie {title:'The Matrix'})
CREATE (neo)-[:APPEARS_IN]->(matrix),
       (morpheus)-[:APPEARS_IN]->(matrix),
       (morpheus)-[:MENTORS]->(neo);

4. Check Logs for Skipped Statements¶

Enable logging to understand what's being skipped:

import logging
logging.basicConfig(
    level=logging.DEBUG,
    format='%(levelname)s: %(message)s'
)

loader = CypherLoader()
loader.load(gf, Path("script.cypher"))
# Review output to confirm expected behavior

Performance Tips¶

Large Scripts¶

For scripts with thousands of statements:

Use persistence: Store to SQLite instead of in-memory

gf = GraphForge("database.db")  # Use SQLite backend
loader.load(gf, Path("large_script.cypher"))

Batch related operations: Group CREATE statements together

// Good: Batch creates
CREATE (a:Person {name: 'Alice'}),
       (b:Person {name: 'Bob'}),
       (c:Person {name: 'Charlie'});

// Less efficient: Individual creates
CREATE (a:Person {name: 'Alice'});
CREATE (b:Person {name: 'Bob'});
CREATE (c:Person {name: 'Charlie'});

Use MERGE efficiently: MERGE can be slower than CREATE

// Use CREATE when you know nodes don't exist
CREATE (n:Person {id: 123, name: 'Alice'});

// Use MERGE when you need idempotency
MERGE (n:Person {id: 123})
ON CREATE SET n.name = 'Alice', n.created = timestamp()
ON MATCH SET n.accessed = timestamp();

Comparison with Neo4j¶

Feature	GraphForge CypherLoader	Neo4j Cypher Shell
Data manipulation	✅ Full support	✅ Full support
CONSTRAINTS	⏭️ Skipped	✅ Enforced
INDEXES	⏭️ Skipped	✅ Created
CALL procedures	⏭️ Skipped	✅ Executed
Multi-statement	✅ Supported	✅ Supported
Comments	✅ Supported	✅ Supported
File size limit	✅ No limit	⚠️ Memory dependent
Error recovery	❌ Stop on error	⚠️ Configurable

Troubleshooting¶

Problem: Script loads but data is missing¶

Possible causes: 1. Statements were skipped (check DEBUG logs) 2. Pattern matching failed (node variables not bound) 3. Constraints in original script prevented duplicates

Solution:

import logging
logging.basicConfig(level=logging.DEBUG)

loader = CypherLoader()
loader.load(gf, Path("script.cypher"))

# Verify loaded data
movie_count = gf.execute("MATCH (m:Movie) RETURN count(m) AS count")
person_count = gf.execute("MATCH (p:Person) RETURN count(p) AS count")
print(f"Movies: {movie_count[0]['count'].value}")
print(f"People: {person_count[0]['count'].value}")

Problem: Syntax errors for valid Neo4j scripts¶

Possible causes: 1. GraphForge doesn't support advanced syntax (e.g., label expressions) 2. Script uses Neo4j-specific extensions

Solution: Check OpenCypher Compatibility for supported features. Common patterns to avoid: - Label expressions: :A|B (union), !:A (negation) - Map projections: node {.*} or node {.prop1, .prop2} - Advanced CALL syntax

Problem: Performance is slow¶

Possible causes: 1. Using in-memory storage for large datasets 2. Many individual CREATE statements instead of batched creates 3. Inefficient MERGE patterns

Solutions: 1. Use SQLite persistence: GraphForge("database.db") 2. Batch CREATE statements: CREATE (a), (b), (c) 3. Optimize MERGE: Use CREATE when appropriate

API Reference¶

CypherLoader¶

class CypherLoader(DatasetLoader):
    """Loader for Cypher script files (.cypher, .cql)."""

    def load(self, gf: GraphForge, path: Path) -> None:
        """Load Cypher script file into GraphForge instance.

        Args:
            gf: GraphForge instance to load data into
            path: Path to .cypher or .cql file

        Raises:
            FileNotFoundError: If script file doesn't exist
            ValueError: If script contains syntax errors
        """

    def get_format(self) -> str:
        """Return the format name this loader handles.

        Returns:
            "cypher"
        """

Constants¶

CypherLoader.SKIP_PREFIXES: ClassVar[list[str]] = [
    "CREATE CONSTRAINT",
    "DROP CONSTRAINT",
    "CREATE INDEX",
    "DROP INDEX",
    "CALL",
]

Dataset Overview - Complete dataset loading guide
Neo4j Examples - Neo4j example datasets
OpenCypher Compatibility - Supported Cypher features
API Reference - Complete API documentation

GitHub Issues¶

Examples¶

See examples/ directory for complete examples: - examples/05_load_datasets.py - Dataset loading examples - Coming soon: examples/06_cypher_scripts.py - Cypher script loading examples