Graph Scaling

Current Architecture

PG Atlas operates with PostgreSQL 18 + NetworkX for graph storage and computation:

Storage — PostgreSQL tables
Computation — NetworkX in-memory graphs for metric calculation projections)
Scale — Fits comfortably in RAM; full metrics recomputation in under a minute

This hybrid approach balances simplicity (single managed database, familiar SQL) with performance (efficient in-memory graph algorithms).

Scaling characteristics:

Read scaling — PostgreSQL read replicas for API query distribution
Write scaling — Single-node PostgreSQL handles current ingestion volume (bootstrap weekly, sbom-queue hourly, gitlog every 3 days)
Computation scaling — NetworkX loads active subgraph into memory; O(nodes + edges) for traversals

When growth exceeds single-node PostgreSQL capabilities, scaling options include native graph databases and distributed systems.

Native Property Graph Databases

JanusGraph (BerkeleyDB backend, single-node)

Pros:

Native TinkerPop/Gremlin — optimal for traversals (active subgraph upstream propagation) and OLAP batch jobs.
BerkeleyDB is embeddable/file-based — true zero-cluster persistence, low overhead.
Efficient indexed writes for per-project updates; supports batch transactions.
Proven for property graphs with versioning/multi-properties.
Seamless future scaling (swap to Cassandra/Scylla).

Cons:

Java ecosystem — communication between JVM and FastAPI/Procrastinate provides friction.
Slightly higher memory footprint than pure relational for small graphs.

Sqlg (over PostgreSQL)

Pros:

Gremlin queries on familiar relational backend — no new infrastructure.
PostgreSQL excels at mixed workloads: graph edges + tabular audit records in separate schemas.
Excellent batch update performance (SQL SET for activity flags across thousands of rows).
Fast incremental writes via standard ORM.
Easy audit/export with SQL tools.

Cons:

Graph traversals less optimized than native graph DBs (recursive CTEs slower for deep queries).
Potential impedance mismatch for complex OLAP (still relies on NetworkX load for heavy metrics).
Migration to full distributed graph later requires data export.

HugeGraph (RocksDB backend — single-node)

Pros:

Native Gremlin with strong OLTP/OLAP support.
RocksDB embeddable and high-performance for writes.
Built-in schema flexibility for versioning.
Good batch transaction support.

Cons:

Less mature community/maintenance than JanusGraph.
Configuration overhead higher than BerkeleyDB embed.
Scaling path less standardized than JanusGraph.

SurrealDB

Overview: SurrealDB is a multi-model database (document, graph, relational) with a SQL-like query language (SurrealQL) that supports graph traversals natively. Single Rust binary, no JVM, embeddable or client-server.

Introduced by @waldmatias during the v0 storage decision (issue #2).

Pros:

Unified multi-model: Graph traversals and tabular data in one system. No NetworkX sidecar, no separate tables for audit records — everything in SurrealQL.
SQL-like syntax: Potentially lower learning curve than Gremlin for contributors with SQL background. Graph traversals use <-> and <- operators in queries.
Single binary deployment: Aligns perfectly with minimal-DevOps constraint. No JVM memory overhead.
Rust performance: Low memory footprint, fast concurrent reads, good write throughput.
Schema flexibility: Schemaless by default but supports strict schema definitions. Good for rapid iteration.
Built-in features: Change feeds (for real-time updates), full-text search, multiple storage backends (memory, file, TiKV).

Cons:

Zero team experience: No PG Atlas contributors have used SurrealDB in production. For a system that factors into funding decisions, this is a meaningful risk.
Project maturity: Post-v1.0 but younger than PostgreSQL (30+ years) or TinkerPop (10+ years). Fewer production war stories, smaller community, less StackOverflow coverage.
Ecosystem tooling: Python client exists but is less mature than psycopg3 or SQLAlchemy. Integration with FastAPI/Pydantic requires custom work.
No TinkerPop compatibility: If we later migrate to JanusGraph or another TinkerPop backend, SurrealQL queries don’t port to Gremlin any more easily than SQL + NetworkX (maybe slightly easier due to native graph operators).
Uncertain scaling path: While SurrealDB claims horizontal scaling via TiKV backend, production evidence at scale is limited compared to Cassandra/Scylla (JanusGraph’s proven path).

When SurrealDB makes sense:

If the team is willing to invest learning time upfront.
If we want to avoid the dual PostgreSQL + NetworkX architecture and prefer native graph traversals in storage layer.
If we’re comfortable with a newer tool and can contribute back to the ecosystem (Scientific Python ethos).
As a migration target post-v0 if PostgreSQL + NetworkX hits scaling limits and we want to avoid JVM operational overhead.

Decision context: During the v0 storage discussion, @waldmatias introduced SurrealDB but recommended Option B (PostgreSQL + NetworkX) for v0, noting that SurrealDB remains an interesting option to revisit during scaling discussions. The team agreed this was the pragmatic path: ship fast with known tools, reevaluate (TinkerPop vs. SurrealDB vs. PostgreSQL extensions) when we hit actual scaling constraints.

Migration Decision Criteria

Consider migrating from PostgreSQL + NetworkX when:

Graph size — Exceeds 100K nodes or in-memory computation time > 5 minutes
Query complexity — Deep traversals (>5 hops) become performance bottlenecks
Real-time requirements — Need low-latency transitive queries via API (not pre-computed)
Distributed needs — Multi-region deployment or horizontal scaling required

Current assessment (after Build Award completion): PostgreSQL + NetworkX meets all performance requirements. No immediate migration needed.

Recommended Migration Path

When scaling becomes necessary:

JanusGraph + BerkeleyDB (single-node) — Migrate to TinkerPop/Gremlin for native graph traversals while maintaining single-node simplicity
JanusGraph + Cassandra/Scylla (distributed) — Scale horizontally when BerkeleyDB limits reached
Alternative: SurrealDB — Consider if multi-model database appeals and team is willing to invest in newer ecosystem

The first two options preserve TinkerPop compatibility:

Start with chosen single-node → add distributed backend later if needed.
Export path: Gremlin bulk dump or standard serialization.
Traversals stay in Gremlin — no major rewrite when scaling

We can investigate adding a TinkerPop-compatible interface to SurrealDB, which would allow us to write Gremlin in Python without adding a JVM dependency.