Skip to content

Cypher Script Loading

GraphForge can load .cypher and .cql script files commonly used for Neo4j example datasets and data imports. The CypherLoader automatically handles schema-related statements that aren't needed for embedded database use.

Overview

The CypherLoader is designed to: - Parse multi-statement Cypher scripts - Execute data manipulation statements (CREATE, MERGE, SET, etc.) - Automatically skip schema operations that aren't required for embedded use - Provide transparent logging of skipped statements

Basic Usage

Loading a Script File

from graphforge import GraphForge
from graphforge.datasets.loaders import CypherLoader
from pathlib import Path

# Create a GraphForge instance
gf = GraphForge()

# Create a loader and load the script
loader = CypherLoader()
loader.load(gf, Path("data/movies.cypher"))

# Query the loaded data
results = gf.execute("MATCH (m:Movie) RETURN m.title")

Example Script

// Create constraint (automatically skipped)
CREATE CONSTRAINT movie_title IF NOT EXISTS
FOR (m:Movie) REQUIRE m.title IS UNIQUE;

// Create index (automatically skipped)
CREATE INDEX person_name IF NOT EXISTS FOR (p:Person) ON (p.name);

// Create data (executed)
CREATE (TheMatrix:Movie {title:'The Matrix', released:1999});
CREATE (Keanu:Person {name:'Keanu Reeves', born:1964});

// Create relationship (executed)
MATCH (keanu:Person {name:'Keanu Reeves'}), (matrix:Movie {title:'The Matrix'})
CREATE (keanu)-[:ACTED_IN]->(matrix);

Supported Statements

✅ Fully Supported (Executed)

All standard Cypher query statements are executed:

  • CREATE - Node and relationship creation
  • MERGE - Create or match patterns with ON CREATE/MATCH SET
  • MATCH - Pattern matching
  • SET - Property updates
  • DELETE / DETACH DELETE - Node and relationship deletion
  • REMOVE - Property and label removal
  • WITH - Query chaining
  • WHERE - Filtering
  • RETURN - Result projection
  • ORDER BY, LIMIT, SKIP - Result ordering and pagination
  • UNWIND - List expansion

⏭️ Automatically Skipped (Schema Operations)

The following statements are automatically skipped because they're not needed for embedded database use:

CREATE CONSTRAINT / DROP CONSTRAINT

CREATE CONSTRAINT movie_title IF NOT EXISTS
FOR (m:Movie) REQUIRE m.title IS UNIQUE;

DROP CONSTRAINT movie_title IF EXISTS;

Why skipped: GraphForge is an embedded database designed for analysis and prototyping, not production use. Schema validation can be done in Python application code if needed.

Supported constraint types (all skipped): - UNIQUE constraints - Uniqueness validation - EXISTS / IS NOT NULL constraints - Required properties - NODE KEY constraints - Composite keys

Impact: None. Data loads successfully without constraints. If you need validation, implement it in your Python code.

Related: Issue #59

CREATE INDEX / DROP INDEX

CREATE INDEX person_name IF NOT EXISTS FOR (p:Person) ON (p.name);
CREATE INDEX movie_released FOR (m:Movie) ON (m.released);

DROP INDEX person_name IF EXISTS;

Why skipped: SQLite (GraphForge's storage backend) automatically creates indexes as needed. Explicit index creation is not exposed in the API.

Supported index types (all skipped): - Single property indexes - Composite indexes - Full-text indexes - Spatial indexes

Impact: Minor. Queries work but may be slower on very large datasets (>1M nodes). For small to medium datasets (<1M nodes), automatic SQLite indexing is sufficient.

Related: Issue #62

CALL Procedures

CALL db.labels() YIELD label;
CALL apoc.periodic.iterate(...);
CALL db.index.fulltext.createIndex(...);

Why skipped: GraphForge doesn't have a stored procedure system. Most operations that would use procedures can be done more naturally in Python.

Common procedure patterns: - CALL db.labels() → Use Python: gf.get_labels() - CALL db.relationshipTypes() → Use Python: gf.get_relationship_types() - CALL apoc.* → Implement in Python or use NetworkX

Impact: Low. Procedures are rarely used in data loading scripts. Most CALL statements are for database administration or advanced operations that aren't needed for data imports.

Related: Issue #63

Script Format

Statement Separation

Statements are separated by semicolons (;):

CREATE (a:Person {name: 'Alice'});
CREATE (b:Person {name: 'Bob'});
CREATE (a)-[:KNOWS]->(b);

Comments

Single-line comments are supported:

// This is a comment
CREATE (n:Node {id: 1}); // Inline comment

URLs are preserved: The loader correctly handles URLs with // in property values:

CREATE (w:Website {url: 'https://example.com'}); // This comment is removed

Multi-line Statements

Statements can span multiple lines:

CREATE (matrix:Movie {
    title: 'The Matrix',
    released: 1999,
    tagline: 'Welcome to the Real World'
});

Empty Statements

Multiple semicolons and empty statements are handled gracefully:

CREATE (a:A);;;
CREATE (b:B);;
// Result: Two nodes created, empty statements ignored

Logging

The CypherLoader provides transparent logging at different levels:

DEBUG Level

View skipped statements:

import logging
logging.basicConfig(level=logging.DEBUG)

loader = CypherLoader()
loader.load(gf, Path("movies.cypher"))

Output:

DEBUG:graphforge.datasets.loaders.cypher:Skipping unsupported statement: CREATE CONSTRAINT movie_title IF NOT EXISTS...
DEBUG:graphforge.datasets.loaders.cypher:Skipping unsupported statement: CREATE INDEX person_name IF NOT EXISTS...

INFO Level

View execution summary:

import logging
logging.basicConfig(level=logging.INFO)

loader = CypherLoader()
loader.load(gf, Path("movies.cypher"))

Output:

INFO:graphforge.datasets.loaders.cypher:Loading Cypher script: movies.cypher
INFO:graphforge.datasets.loaders.cypher:Loaded movies.cypher: 15 statements executed, 3 skipped

Error Handling

File Not Found

loader.load(gf, Path("nonexistent.cypher"))
# Raises: FileNotFoundError: Cypher script not found: nonexistent.cypher

Syntax Errors

If a statement has invalid Cypher syntax, loading stops with a detailed error:

loader.load(gf, Path("invalid.cypher"))
# Raises: ValueError: Cypher execution error: No terminal matches 'X' in the current parser context

The error message includes: - The line and column where the error occurred - The problematic statement (first 100 characters) - The expected tokens

Best Practices

1. Use Comments Liberally

Document your scripts for maintainability:

// ============================================
// Movie Graph Dataset
// Source: Neo4j Examples
// Last Updated: 2024-01-15
// ============================================

// Create movies
CREATE (TheMatrix:Movie {title:'The Matrix', released:1999});
CREATE (TheMatrixReloaded:Movie {title:'The Matrix Reloaded', released:2003});

Organize scripts by logical sections:

// ======================
// Schema (auto-skipped)
// ======================
CREATE CONSTRAINT movie_title FOR (m:Movie) REQUIRE m.title IS UNIQUE;

// ======================
// Nodes
// ======================
CREATE (m1:Movie {title:'The Matrix'});
CREATE (m2:Movie {title:'The Matrix Reloaded'});

// ======================
// Relationships
// ======================
MATCH (m1:Movie {title:'The Matrix'}), (m2:Movie {title:'The Matrix Reloaded'})
CREATE (m1)-[:SEQUEL]->(m2);

3. Use Variables for Complex Patterns

Make scripts more readable:

// Create nodes
CREATE (neo:Character {name:'Neo'});
CREATE (morpheus:Character {name:'Morpheus'});
CREATE (matrix:Movie {title:'The Matrix'});

// Create relationships using variables
MATCH (neo:Character {name:'Neo'}),
      (morpheus:Character {name:'Morpheus'}),
      (matrix:Movie {title:'The Matrix'})
CREATE (neo)-[:APPEARS_IN]->(matrix),
       (morpheus)-[:APPEARS_IN]->(matrix),
       (morpheus)-[:MENTORS]->(neo);

4. Check Logs for Skipped Statements

Enable logging to understand what's being skipped:

import logging
logging.basicConfig(
    level=logging.DEBUG,
    format='%(levelname)s: %(message)s'
)

loader = CypherLoader()
loader.load(gf, Path("script.cypher"))
# Review output to confirm expected behavior

Performance Tips

Large Scripts

For scripts with thousands of statements:

  1. Use persistence: Store to SQLite instead of in-memory
gf = GraphForge("database.db")  # Use SQLite backend
loader.load(gf, Path("large_script.cypher"))
  1. Batch related operations: Group CREATE statements together
// Good: Batch creates
CREATE (a:Person {name: 'Alice'}),
       (b:Person {name: 'Bob'}),
       (c:Person {name: 'Charlie'});

// Less efficient: Individual creates
CREATE (a:Person {name: 'Alice'});
CREATE (b:Person {name: 'Bob'});
CREATE (c:Person {name: 'Charlie'});
  1. Use MERGE efficiently: MERGE can be slower than CREATE
// Use CREATE when you know nodes don't exist
CREATE (n:Person {id: 123, name: 'Alice'});

// Use MERGE when you need idempotency
MERGE (n:Person {id: 123})
ON CREATE SET n.name = 'Alice', n.created = timestamp()
ON MATCH SET n.accessed = timestamp();

Comparison with Neo4j

Feature GraphForge CypherLoader Neo4j Cypher Shell
Data manipulation ✅ Full support ✅ Full support
CONSTRAINTS ⏭️ Skipped ✅ Enforced
INDEXES ⏭️ Skipped ✅ Created
CALL procedures ⏭️ Skipped ✅ Executed
Multi-statement ✅ Supported ✅ Supported
Comments ✅ Supported ✅ Supported
File size limit ✅ No limit ⚠️ Memory dependent
Error recovery ❌ Stop on error ⚠️ Configurable

Troubleshooting

Problem: Script loads but data is missing

Possible causes: 1. Statements were skipped (check DEBUG logs) 2. Pattern matching failed (node variables not bound) 3. Constraints in original script prevented duplicates

Solution:

import logging
logging.basicConfig(level=logging.DEBUG)

loader = CypherLoader()
loader.load(gf, Path("script.cypher"))

# Verify loaded data
movie_count = gf.execute("MATCH (m:Movie) RETURN count(m) AS count")
person_count = gf.execute("MATCH (p:Person) RETURN count(p) AS count")
print(f"Movies: {movie_count[0]['count'].value}")
print(f"People: {person_count[0]['count'].value}")

Problem: Syntax errors for valid Neo4j scripts

Possible causes: 1. GraphForge doesn't support advanced syntax (e.g., label expressions) 2. Script uses Neo4j-specific extensions

Solution: Check OpenCypher Compatibility for supported features. Common patterns to avoid: - Label expressions: :A|B (union), !:A (negation) - Map projections: node {.*} or node {.prop1, .prop2} - Advanced CALL syntax

Problem: Performance is slow

Possible causes: 1. Using in-memory storage for large datasets 2. Many individual CREATE statements instead of batched creates 3. Inefficient MERGE patterns

Solutions: 1. Use SQLite persistence: GraphForge("database.db") 2. Batch CREATE statements: CREATE (a), (b), (c) 3. Optimize MERGE: Use CREATE when appropriate

API Reference

CypherLoader

class CypherLoader(DatasetLoader):
    """Loader for Cypher script files (.cypher, .cql)."""

    def load(self, gf: GraphForge, path: Path) -> None:
        """Load Cypher script file into GraphForge instance.

        Args:
            gf: GraphForge instance to load data into
            path: Path to .cypher or .cql file

        Raises:
            FileNotFoundError: If script file doesn't exist
            ValueError: If script contains syntax errors
        """

    def get_format(self) -> str:
        """Return the format name this loader handles.

        Returns:
            "cypher"
        """

Constants

CypherLoader.SKIP_PREFIXES: ClassVar[list[str]] = [
    "CREATE CONSTRAINT",
    "DROP CONSTRAINT",
    "CREATE INDEX",
    "DROP INDEX",
    "CALL",
]

GitHub Issues

Examples

See examples/ directory for complete examples: - examples/05_load_datasets.py - Dataset loading examples - Coming soon: examples/06_cypher_scripts.py - Cypher script loading examples