Lightweight openCypher-Compatible Graph Engine¶

Requirements Document (Draft)¶

1. Purpose¶

This document defines the functional and non-functional requirements for a lightweight, embedded, openCypher-compatible graph engine designed specifically for research, investigative, and analytical workflows in Python-centric data science and machine learning environments.

This project implements a declared subset of the openCypher specification, validated via the openCypher Technology Compatibility Kit (TCK), rather than claiming full language coverage.

The system is intentionally scoped to support graph materialization and graph analytics as intermediate analytical steps, not as a long-lived production database. It provides a standardized, portable, and semantically correct way to work with graphs during information extraction, investigation, and exploratory analysis.

2. Standards & Compatibility¶

2.1 openCypher Alignment¶

The engine MUST: - Parse and validate queries using the openCypher grammar - Follow openCypher semantic rules for pattern matching, filtering, and projection - Maintain compatibility with the openCypher Technology Compatibility Kit (TCK) for supported features

The engine MUST NOT: - Introduce proprietary Cypher extensions in the core language - Silently accept unsupported syntax or semantics

2.2 TCK Compliance Model¶

The project SHALL define a clear TCK feature coverage matrix
Each openCypher feature SHALL be explicitly categorized as:
Supported
Unsupported (with defined failure behavior)
Unsupported features MUST:
Fail deterministically
Produce clear, descriptive, spec-aligned errors

3. Design Principles¶

Embedded-first (no server or daemon)
Local-first (single-node execution)
Graph-native execution (no relational joins)
Spec-driven correctness over performance
Deterministic and reproducible results
Inspectable storage and execution behavior
Python-first developer experience

The design philosophy mirrors SQLite: minimal operational overhead, stable APIs, and replaceable internals.

4. Intended Usage & Scope¶

4.1 What This Project Is¶

This system provides a graph workbench for: - Materializing extracted entities and relationships into a property graph - Iteratively refining and revising that graph - Querying and analyzing graph structure using openCypher

It is designed to live inside Python workflows such as: - Notebooks - Research scripts - Agentic or LLM-driven pipelines

4.2 What This Project Is Not¶

This system is explicitly NOT: - A data ingestion platform - An information extraction system - A production graph database - A graph-serving backend - A distributed or multi-tenant service

5. Canonical Workflow Pattern (Scoped)¶

This project operates exclusively as an intermediate analytical layer.

Upstream Context (Out of Scope)¶

The following steps are assumed to occur outside this system:

Data ingestion from structured or unstructured sources
Entity and relationship extraction (including probabilistic or noisy outputs)

No assumptions are made about how entities or relationships are produced.

Core Responsibility (In Scope)¶

3. Graph Materialization¶

The system MUST support: - Creation of nodes and relationships from extracted data - Iterative updates, corrections, and revisions - Durable but disposable graph persistence - Multiple experimental or competing graph states

Graphs at this stage may be incomplete, inconsistent, or exploratory.

4. Graph Exploration & Analytics¶

The system MUST support: - Pattern matching using openCypher - Structural exploration of graphs - Identification of: - Variations - Outliers - Structural anomalies - Unexpected relationships

This phase is explicitly analytical and investigative.

Downstream Context (Out of Scope)¶

The system does NOT handle: - Final data curation or validation - Long-term systems of record - Production database serving - Feature stores or ML model hosting

6. Data Model Requirements¶

6.1 Nodes¶

Each node MUST have:
A unique internal identifier
Zero or more labels
Zero or more properties
Node identity MUST be stable within a transaction

6.2 Relationships¶

Each relationship MUST have:
A unique internal identifier
A source node
A destination node
Exactly one relationship type
Directionality
Zero or more properties

6.3 Properties¶

Properties MUST support openCypher value types: - Integer - Float - Boolean - String - Null - List - Map

Null propagation and comparison semantics MUST follow the openCypher specification.

7. Data Models, Schemas, and Ontologies¶

7.1 Purpose¶

The system MUST support optional data models (also referred to as ontologies or schemas) that provide semantic structure over nodes and relationships without imposing rigid database-style schemas.

These models are intended to: - Standardize meaning across investigative and extraction workflows - Improve consistency in graph materialization - Enable validation and tooling in Python-based environments - Remain flexible enough for exploratory and probabilistic data

7.2 Compatibility Requirements¶

Data models MUST be expressible in formats compatible with: - Pydantic models - JSON Schema (draft-agnostic, best-effort)

This ensures interoperability with: - Python data validation tooling - LLM extraction pipelines - External schema and ontology tooling

7.3 Scope of Enforcement¶

Data models: - MUST be optional - MUST NOT be required to create or query graphs - MUST NOT prevent insertion of incomplete or uncertain data by default

Schema enforcement SHOULD be: - Advisory rather than mandatory - Configurable by the user (e.g. strict vs permissive modes)

7.4 Modeling Capabilities¶

Data models SHOULD be able to define:

Node types (conceptual classes)
Relationship types
Allowed properties and value types
Optional vs required properties
Inheritance or specialization (where supported by the model format)

These models MAY be used to: - Validate extracted entities and relationships - Annotate nodes and relationships with semantic meaning - Assist in query formulation and interpretation

7.5 Relationship to Cypher Semantics¶

Data models MUST: - Remain orthogonal to openCypher semantics - NOT alter Cypher query meaning or execution results - Provide metadata and validation layers only

Cypher queries MUST operate on graph data regardless of whether a data model is present.

8. Query Language Requirements¶

7. Query Language Requirements¶

7.1 Supported Constructs (v1)¶

The engine MUST support the following openCypher constructs exactly as specified:

MATCH (nodes, relationships, directionality)
WHERE (boolean logic, comparisons, property access)
RETURN (expressions, aliases, multiple projections)
LIMIT
SKIP

7.2 Unsupported Constructs (v1)¶

The following constructs MAY be parsed but MUST fail validation or execution:

CREATE, DELETE, SET, MERGE
OPTIONAL MATCH
Variable-length paths
Aggregations
Subqueries
Procedures

Failures MUST be deterministic and explicitly documented.

8. Execution Engine Requirements¶

The execution engine MUST: - Operate on graph-native primitives - Use adjacency-based traversal - Implement operators for: - Node scanning - Relationship expansion - Filtering - Projection - Limiting - Preserve openCypher semantics throughout execution

Query planning MAY be rule-based; cost-based planning is out of scope.

9. Storage Engine Requirements¶

9.1 Implementation Approach¶

The storage layer SHALL use SQLite as the persistence backend.

Rationale: - SQLite provides ACID transactions, WAL mode, and crash recovery out-of-the-box - Zero operational overhead (embedded, single-file, zero-config) - Battle-tested durability (20+ years, billions of deployments) - Cross-platform compatibility with no external dependencies - Aligns with "mirrors SQLite" design philosophy - Allows focus on openCypher execution rather than storage implementation

See docs/storage-architecture-analysis.md for detailed analysis.

9.2 Storage Requirements¶

The storage layer MUST: - Be durable across crashes (SQLite WAL mode) - Support atomic commits (SQLite transactions) - Use WAL journaling (SQLite PRAGMA journal_mode=WAL) - Support snapshot isolation for readers (SQLite WAL mode) - Store adjacency lists explicitly (graph-specific schema design) - Preserve stable internal IDs (application-managed ID generation)

The storage engine MUST remain opaque to Cypher semantics.

10. Concurrency Model¶

Single writer at a time
Multiple concurrent readers
Readers MUST see only committed state
Writers MUST not observe partial writes

11. Python API Requirements¶

rows = db.execute("""
MATCH (n:Person)
WHERE n.age > 30
RETURN n.name
LIMIT 5
""")

API guarantees: - Synchronous execution - Deterministic results - Typed exceptions for parse, validation, and execution errors - Reusable database handle

12. Testing & Validation¶

12.1 openCypher TCK Integration¶

The openCypher TCK MUST be integrated into continuous integration (CI)
TCK tests MUST be runnable in an automated, reproducible manner
Each TCK test MUST be explicitly classified as:
Pass (fully supported and compliant)
Skip (feature intentionally unsupported)
Expected Failure (known limitation, documented)

A public TCK Coverage Matrix MUST be maintained and versioned with the codebase.

12.2 Regression Testing¶

All supported openCypher features MUST have regression tests
Regression tests MUST ensure semantic stability across releases
Storage durability MUST be tested across process restarts

13. Non-Functional Requirements¶

Performance¶

Correctness prioritized over throughput
Target scale (best effort):
~10^6 nodes
~10^7 relationships

Portability¶

macOS, Linux, Windows
Python 3.10 or newer

Observability¶

Inspectable query plans
Configurable debug logging
Documented storage layout

14. Explicit Non-Goals¶

This system is NOT intended to: - Fully implement the openCypher language - Replace production graph databases - Serve as a long-running graph service - Achieve full TCK coverage in v1 - Support high-concurrency OLTP workloads - Introduce Cypher dialect fragmentation

15. Success Criteria (v1)¶

The project is successful if: - A user can materialize a graph from extracted entities and relationships - Execute valid openCypher MATCH queries within the declared feature set - Pass the corresponding subset of the openCypher TCK - Persist graphs across restarts - Use the system entirely embedded, without external services

16. Comparison to Existing Approaches¶

This project intentionally occupies a middle ground between in-memory graph libraries and production-scale graph databases. The following comparisons clarify why neither "just using NetworkX" nor running an external graph database fully satisfies the intended use cases.

16.1 Comparison: Using NetworkX Alone¶

What NetworkX Provides

NetworkX is an excellent Python library for: - Graph algorithms (centrality, clustering, paths) - Rapid prototyping - In-memory graph manipulation

However, NetworkX is explicitly not a graph engine. It lacks several capabilities that are critical for investigative and analytical workflows at scale.

Limitations of NetworkX for These Use Cases

No durable storage (graphs must be serialized manually)
No standardized query language
No declarative pattern matching
No snapshot isolation or transactional semantics
No schema or semantic enforcement
Poor reproducibility across sessions without custom glue code

As a result, NetworkX graphs tend to become: - Ephemeral - Ad-hoc - Difficult to share or reproduce - Tightly coupled to specific scripts or notebooks

How This Project Differs

This system complements NetworkX rather than replacing it: - Provides durable, inspectable graph storage - Supports declarative pattern matching via openCypher - Enforces consistent graph semantics - Enables reproducible analytical workflows

NetworkX is expected to remain a downstream consumer of graphs produced by this system, particularly for algorithmic analysis.

16.2 Comparison: External Graph Databases (Neo4j, Memgraph, etc.)¶

What External Graph Databases Provide

Production graph databases excel at: - Long-lived, authoritative graph storage - High-performance traversals - Concurrent multi-user access - Operational robustness

They are optimized for serving applications, not exploratory analysis.

Limitations in Research & Investigative Contexts

For Python-based research workflows, external graph databases introduce significant friction:

Require separate processes or services
Impose operational overhead (installation, configuration, lifecycle)
Break notebook-local execution models
Encourage premature schema and data-model finalization
Make iterative or disposable graphs costly to manage

These systems assume that the graph is: - Clean - Stable - Long-lived - Worth operational investment

This assumption does not hold during information extraction or investigative analysis.

How This Project Differs

This system: - Runs entirely embedded within Python - Requires no external services - Encourages iterative, revisable graph construction - Treats graphs as analytical artifacts, not systems of record - Optimizes for semantic correctness and reproducibility over throughput

Export to production graph databases is explicitly supported after analytical refinement.

16.3 Summary Comparison¶

Dimension	NetworkX	External Graph DBs	This Project
Execution Model	In-memory	External service	Embedded
Durability	None (manual)	Persistent	Persistent
Query Language	None	Cypher	openCypher
Graph Semantics	Weakly enforced	Strong	Strong
Iterative Analysis	Excellent	Poor	Excellent
Operational Overhead	Minimal	High	Minimal
Notebook-Friendly	Yes	No	Yes
Production Serving	No	Yes	No

This project exists specifically to fill the analytical gap between these two extremes.

17. Cypher Support Clarification (Non-Normative)¶

To avoid ambiguity, the following clarifications apply:

openCypher compatibility refers to semantic correctness for supported features, not total language coverage
Unsupported clauses and expressions MUST fail explicitly and deterministically
The absence of a feature does not imply partial or degraded semantics for supported features

Users should expect: - Strong semantic guarantees within the supported subset - Clear error messages for unsupported syntax - Gradual, explicit expansion of Cypher coverage over time

18. Why This Exists (README Excerpt)¶

The Problem¶

Modern data science, machine learning, and investigative workflows increasingly rely on entities and relationships extracted from messy, probabilistic sources: text, tables, OCR, logs, and LLM outputs. While these workflows naturally produce graph-shaped data, practitioners are forced into poor tooling choices:

In-memory graph libraries (e.g. NetworkX) that lack durability, semantics, and declarative querying
Production graph databases that impose operational overhead, rigidity, and premature commitment

As a result, graph-based analysis during research and investigation is often ad-hoc, non-reproducible, and tightly coupled to one-off scripts.

The Gap¶

There is a missing middle layer between:

Ephemeral in-memory graphs used for algorithms
Production graph databases used as systems of record

This gap is where most investigative and information-extraction work actually happens.

Researchers and ML engineers need a way to: - Materialize extracted entities and relationships into a graph - Iteratively revise and explore that graph - Ask declarative, pattern-based questions - Do all of this inside Python, without running external services

The Idea¶

This project provides a lightweight, embedded, openCypher-compatible graph engine designed specifically for that gap.

It is: - Embedded and local-first (no server) - Graph-native (adjacency-based execution) - Declarative (openCypher subset) - Durable but disposable - Designed for analytical, not operational, workloads

Rather than replacing production graph databases, it acts as a graph workbench:

Build and revise graphs during extraction and investigation
Explore structure, patterns, and anomalies
Export refined results into systems like Neo4j or Memgraph after analysis

What This Is (and Is Not)¶

This is: - A standardized, portable environment for graph-based analysis - A Cypher-compatible execution engine for research workflows - A bridge between extraction pipelines and production systems

This is not: - A production graph database - A high-concurrency graph service - A replacement for Neo4j, Memgraph, or TigerGraph

Why openCypher¶

openCypher provides a widely understood, declarative way to reason about graphs.

By aligning with the openCypher specification and validating behavior with the openCypher TCK (for supported features), this project ensures: - Semantic correctness - Portability of queries - Low friction when moving results to production systems

Philosophy¶

We are not building a database for applications. We are building a graph execution environment for thinking.

19. Future Considerations (Non-Binding)¶

Incremental TCK coverage expansion
Write support (CREATE, MERGE)
Aggregations and grouping
Variable-length path traversal
Native execution core
Interoperability with in-memory graph libraries