Graph Scaling
Native Property Graph DB Options
JanusGraph (BerkeleyDB backend, single-node)
Pros:
- Native TinkerPop/Gremlin — optimal for traversals (active subgraph upstream propagation) and OLAP batch jobs.
- BerkeleyDB is embeddable/file-based — true zero-cluster persistence, low overhead.
- Efficient indexed writes for per-project updates; supports batch transactions.
- Proven for property graphs with versioning/multi-properties.
- Seamless future scaling (swap to Cassandra/Scylla).
Cons:
- Java ecosystem (vs. Python-native tooling).
- Slightly higher memory footprint than pure relational for small graphs.
- Git log/pony factor data needs separate storage or JSON properties.
Sqlg (over PostgreSQL)
Pros:
- Gremlin queries on familiar relational backend — no new infrastructure.
- PostgreSQL excels at mixed workloads: graph edges + tabular pony factor/git logs in separate schemas.
- Excellent batch update performance (SQL SET for activity flags across thousands of rows).
- Fast incremental writes via standard ORM.
- Easy audit/export with SQL tools.
Cons:
- Graph traversals less optimized than native graph DBs (recursive CTEs slower for deep queries).
- Potential impedance mismatch for complex OLAP (still relies on NetworkX load for heavy metrics).
- Migration to full distributed graph later requires data export.
HugeGraph (RocksDB backend — single-node)
Pros:
- Native Gremlin with strong OLTP/OLAP support.
- RocksDB embeddable and high-performance for writes.
- Built-in schema flexibility for versioning.
- Good batch transaction support.
Cons:
- Less mature community/maintenance than JanusGraph.
- Configuration overhead higher than BerkeleyDB embed.
- Pony factor/git logs require separate handling or JSON blobs.
- Scaling path less standardized than JanusGraph.
SurrealDB (embedded or dedicated single-node)
Overview: SurrealDB is a multi-model database (document, graph, relational) with a SQL-like query language (SurrealQL) that supports graph traversals natively. Single Rust binary, no JVM, embeddable or client-server.
Introduced by @waldmatias during the v0 storage decision (issue #2).
Pros:
- Unified multi-model: Graph traversals and tabular data in one system. No NetworkX sidecar, no separate tables for pony factor stats — everything in SurrealQL.
- SQL-like syntax: Potentially lower learning curve than Gremlin for contributors with SQL background. Graph traversals use
<->and<-operators in queries. - Single binary deployment: Aligns perfectly with minimal-DevOps, <$100/month operational constraint. No JVM memory overhead.
- Rust performance: Low memory footprint, fast concurrent reads, good write throughput.
- Schema flexibility: Schemaless by default but supports strict schema definitions. Good for rapid iteration.
- Built-in features: Change feeds (for real-time updates), full-text search, multiple storage backends (memory, file, TiKV).
Cons:
- Zero team experience: Nobody on the working group has used SurrealDB in production. For a system powering funding decisions, this is a meaningful risk.
- Project maturity: Post-v1.0 but younger than PostgreSQL (30+ years) or TinkerPop (10+ years). Fewer production war stories, smaller community, less StackOverflow coverage.
- Ecosystem tooling: Python client exists but is less mature than psycopg3 or SQLAlchemy. Integration with FastAPI/Pydantic requires custom work.
- No TinkerPop compatibility: If we later migrate to JanusGraph or another TinkerPop backend, SurrealQL queries don’t port to Gremlin any more easily than SQL + NetworkX (maybe slightly easier due to native graph operators).
- Uncertain scaling path: While SurrealDB claims horizontal scaling via TiKV backend, production evidence at scale is limited compared to Cassandra/Scylla (JanusGraph’s proven path).
When SurrealDB makes sense:
- If the team is willing to invest learning time upfront.
- If we want to avoid the dual PostgreSQL + NetworkX architecture and prefer native graph traversals in storage layer.
- If we’re comfortable with a newer tool and can contribute back to the ecosystem (Scientific Python ethos).
- As a migration target post-v0 if PostgreSQL + NetworkX hits scaling limits and we want to avoid JVM operational overhead.
Example SurrealQL graph traversal (for comparison):
-- Count transitive dependents (criticality)
SELECT count() FROM depends_on<-project<-depends_on<-project
WHERE activity_status = 'live' AND id = $project_id;
-- Active subgraph projection (simplified)
SELECT id, display_name FROM project
WHERE activity_status = 'live'
AND in_degree = 0
RELATE ->depends_on->project;
Decision context: During the v0 storage discussion, @waldmatias introduced SurrealDB but recommended Option B (PostgreSQL + NetworkX) for v0, noting that SurrealDB remains an interesting option to revisit during scaling discussions. The working group agreed this was the pragmatic path: ship fast with known tools, reevaluate (TinkerPop vs. SurrealDB vs. PostgreSQL extensions) when we hit actual scaling constraints.
Recommended Path
Assume JanusGraph + BerkeleyDB for detailed implementation:
- Best native graph performance for metrics (transitive counts, active subgraph).
- Per-project updates: Gremlin transactions for edge/vertex changes.
- Batch activity updates: Scripted Gremlin or bulk load for flag flips.
- Pony factor: Materialize on repo vertices but store intermediate git contributor stats as an edge type or in a separate data structure.
Migration & Extensibility
The first 3 options preserve TinkerPop compatibility:
- Start with chosen single-node → add distributed backend later if needed.
- Export path: Gremlin bulk dump or standard serialization. Traversals stay in Gremlin and won’t need a major port/rewrite later on.
We can investigate adding a TinkerPop-compatible interface to SurrealDB, which would allow us to write Gremlin in Python without adding a JVM dependency.