Benchmark Results
Verified benchmarks across 10 databases and GPU compute engines: xrayGraphDB vs cuGraph, Neo4j, Memgraph, Kuzu, DuckDB, GraphBLAS, PostgreSQL, MySQL, and NebulaGraph. LDBC SF1, Friendster (3.6B edges), GPU analytics on Blackwell. Apples-to-apples: same source vertex, same hardware. Every number is reproducible.
Every number on this page was measured with AES-256-GCM per-tenant encryption active.
xrayGraphDB encrypts all data at the storage layer — per tenant, per database, always on. There is no “turn off encryption” flag. These benchmarks include the full cost of encrypt-on-write and decrypt-on-read for every edge traversal, every analytic scan, and every query result. When comparing against systems that store data unencrypted, keep this in mind: xrayGraphDB is doing more work on every operation and still posting these numbers.
What Workloads Become Possible
When your graph engine survives billion-edge traversal on a single server, workloads that were previously “requires a cluster” or “not feasible” become interactive.
Feature Comparison
| Feature | xrayGraphDB | cuGraph | Neo4j | Memgraph | Kuzu | DuckDB | PostgreSQL |
|---|---|---|---|---|---|---|---|
| Cypher support | Full + Neo4j compat | — | Native | Partial | Cypher-like | — | — |
| GPU acceleration | CUDA (native) | CUDA native | — | — | — | — | — |
| Native analytics (graph algos) | 26 procedures | BFS, PR, TC, WCC... | GDS plugin | MAGE plugin | — | — | — |
| Friendster 1.8B edges | All algos work | BFS only (directed) | Hop 4 timeout | OOM at 120GB | Hop 3 timeout | Hop 2 timeout | — |
| Persistent storage | Persistent graph store | In-memory only | Disk + cache | In-memory only | Columnar disk | Embedded | Disk |
| Columnar wire protocol | xrayProtocol | Python API | — | — | — | Python API | — |
| SIMD graph operations | Yes (SIMD) | — | — | — | — | SIMD (columnar) | — |
| Graph traversal engine | Cypher + native BFS | GPU BFS kernel | Cypher planner | Cypher planner | Cypher-like | Recursive CTE | Recursive CTE |
Performance Numbers
Measured across three servers with LDBC SF1 (3.18M nodes, 17.2M edges) and Friendster (65.6M nodes, 3.6B undirected edges). May 2026.
(65.6M vertices, 3.6B edges)
(4.17B triangles, SNAP verified)
(vs cuGraph 1/8, others 0/8)
(Friendster load, 6 min)
Server A (.187): 503GB RAM, 64-core EPYC, no GPU. Server B: Docker container on production host. Server C (.68): 62GB RAM, 28-core Xeon E5-2650L, T1000 8GB GPU. Blackwell GPU: RTX PRO 6000 96GB VRAM, 16 vCPU, 144GB RAM. Competitor server: 187GB RAM, 44-core Xeon Gold 6152, Tesla T4. xrayGraphDB v4.9.5+.
LDBC SF1 Interactive Queries — 7-Database Comparison
3.18M nodes, 17.2M edges. All databases on identical LDBC SF1 dataset. Times are warm p50. Units: milliseconds.
| Query | xrayGraphDB | Memgraph | Neo4j | NebulaGraph | DuckDB | PostgreSQL | MySQL |
|---|---|---|---|---|---|---|---|
| IS1 — Profile lookup | 0.7ms | 1.1ms | 2.4ms | 1.6ms | 0.7ms | 55ms | 7.7ms |
| IS3 — Friends of person | 0.9ms | 1.1ms | 2.0ms | 2.2ms | 1.9ms | 47ms | 8.3ms |
| IC5 — New groups of friends | 1.1ms | 1,078ms | 707ms | 1,428ms | 78ms | 3,017ms | 1,959ms |
| IC11 — Job referral | 1.0ms | 2.7ms | 3.5ms | 80.8ms | N/A | N/A | N/A |
| Edge count | 0.5ms | 731ms | 1.5ms | 1.7ms | 0.5ms | 62ms | 68ms |
| Node count | 0.5ms | 470ms | 1.2ms | 1.3ms | 0.4ms | 49ms | 9.2ms |
IC5 is the key differentiator: multi-hop join with grouping. xrayGraphDB completes in 1.1ms where competitors need 78ms–3,017ms. IC11 and graph-specific queries are N/A for SQL databases that lack native traversal. Competitor server: 187GB RAM, 44-core Xeon Gold 6152, Tesla T4.
Apples-to-Apples: Same Vertex, Same Dataset, 7 Systems
BFS from vertex 71768986 (undirected degree 5,214). Friendster: 65.6M vertices, 1.8B edges. RTX PRO 6000 Blackwell (96 GB), 16 vCPU, 144 GB RAM. May 2026.
| System | Hop 1 | Hop 2 | Hop 3 | Hop 4+ | Result |
|---|---|---|---|---|---|
| xrayGraphDB | 5,215 | 2,151,463 | 35,113,876 | ...10 hops | 24s total, 65.6M reached |
| cuGraph 26.02 | BFS kernel: 82ms (20 GTEPS, directed only) | 7/8 algorithms FAILED | |||
| Kuzu 0.11 | 5,214 | 2,151,463 | TIMEOUT >600s | Did not complete hop 3 | |
| DuckDB 1.5 | 5,215 | TIMEOUT >600s | Did not complete hop 2 | ||
| Neo4j 2025.04 | 203 | 7,718 | 461,029 | TIMEOUT | Did not complete hop 4 |
| Memgraph 2.22 | OOM-killed loading 1.8B edges (120 GB) | Cannot load Friendster | |||
| GraphBLAS 9.4 | OOM during BFS (142 GB RAM insufficient) | Cannot run BFS | |||
Vertex counts verified across systems: hop 1 = 5,214–5,215, hop 2 = 2,151,463 (exact match).
cuGraph is NVIDIA's GPU graph analytics library. It achieves 20 GTEPS on pure BFS — but cannot build an undirected Friendster graph on a single 96 GB GPU (cuDF CSV OOM, CSR sort OOM, int32 size_type limit). xrayGraphDB is a persistent graph runtime, not a BFS kernel. Different workload classes, honestly compared.
GPU Compute Engine Head-to-Head
Same RTX PRO 6000 Blackwell (96 GB VRAM). Same Friendster dataset. Courtroom-clean methodology.
| Algorithm | xrayGraphDB | cuGraph 26.02 | GraphBLAS 9.4 |
|---|---|---|---|
| Load raw SNAP file | Direct, no preprocess | 3 paths OOM'd | numpy + scipy (102s) |
| Undirected graph | 3.6B edges | OOM (96 GB VRAM) | 3.6B entries |
| BFS (15 hops) | 5.5s (449 MTEPS) | 82ms (20 GTEPS)* | OOM |
| PageRank (20 iter) | 94s | Failed (convergence) | OOM |
| Triangle Count | 38s (4.17B, GPU 85%) | Failed (needs undirected) | OOM |
| BC Pair-Sampled (ε=0.05) | 5.1s (warm) | Failed | OOM |
| WCC / K-Core / Louvain | All completed | All failed | All OOM |
| Algorithms completed | 8 / 8 | 1 / 8 | 0 / 8 |
*cuGraph BFS ran on directed-only graph (1.8B edges). cuGraph could not build undirected Friendster on 96 GB GPU due to: cuDF CSV parser OOM, CSR sort+symmetrize OOM, and cudf int32 size_type limit (3.6B rows > 2.1B max). cuGraph is optimized for specialized GPU graph kernels; xrayGraphDB is optimized for persistent relationship-intelligence workloads at billion-edge scale.
Sub-Second Betweenness Centrality on 3.6 Billion Edges
Friendster: 65.6 million vertices, 3.6 billion undirected edges. Approximate betweenness centrality (ABRA pair-sampled, ε=0.05) completed in 977 milliseconds (warm, production CPU). On the Blackwell GPU server (Docker): 5.1 seconds warm (ε=0.05), 1.5 seconds (ε=0.10).
Approximate Betweenness Centrality — Friendster undirected graph
Parameters: epsilon=0.05, 95% confidence, deterministic seed
Server: 503GB RAM, 64-core AMD EPYC (.187), no GPU
Context: We tested 7 competitor systems on the same Blackwell GPU hardware with the same dataset. cuGraph, Neo4j, Memgraph, Kuzu, DuckDB, GraphBLAS — none could complete betweenness centrality on Friendster. Most couldn't even load it. “Impossible” is not a claim — it is a measured result.
Where xrayGraphDB Leaves Everyone Behind
What this measures: Starting from a single person in a social network, how many people can you reach at each degree of separation? Hop 1 = direct friends. Hop 2 = friends of friends. Hop 3 = three degrees out. By hop 4, you've touched 93% of the entire 65-million-person graph.
Why it's so hard: At each hop, the frontier explodes. Hop 3 adds 33 million new vertices. Hop 4 adds another 26 million. The system must track which of 65 million vertices have already been visited, expand every edge from millions of frontier vertices simultaneously, deduplicate the results, and do it all without running out of memory. Most databases crash, OOM, or timeout before hop 4.
xrayGraphDB solves this with a patent-pending native traversal engine. Frontier expansion happens directly against the resident graph — no query parser overhead, no per-call planning. Memory stays bounded as the frontier grows, and the storage layout keeps neighbor lookups cache-friendly even as the graph evolves. The result: full-graph traversal in 5.5 seconds where every competitor tested either crashed or timed out.
Friendster: 65.6 million nodes, 3.6 billion undirected edges. Blackwell GPU server (Docker container). Source: vertex 71768986 (degree 5,214).
| Hop | Cumulative Vertices | Coverage | Competitors |
|---|---|---|---|
| 1 | 5,215 | 0.0% | DuckDB: 1.8s, Kuzu: 0.2s |
| 2 | 2,151,463 | 3.3% | DuckDB: TIMEOUT. Kuzu: 0.9s |
| 3 | 35,113,876 | 53.5% | Kuzu: TIMEOUT. Neo4j: 10s |
| 4 | 61,240,094 | 93.3% | Neo4j: TIMEOUT. All others: OOM |
| 5 | 64,261,382 | 97.9% | No competitor reached hop 5 |
| 10 | 65,599,565 | 99.99% | No competitor reached hop 5 |
| 15 | 65,608,278 | 100.0% | No competitor reached hop 5 |
| Total time | 5.505 seconds | 100% | All failed or timed out |
Vertex counts verified across systems: hop 1 = 5,214–5,215, hop 2 = 2,151,463 (exact match between xrayGraphDB, Kuzu, and DuckDB). Same source vertex (71768986), same dataset, same hardware. Apples to apples.
Why competitors fail: Traditional databases use Cypher query planners that expand variable-length paths via depth-first search. At hop 3, the frontier is 33 million vertices — each with ~55 neighbors on average. That's 1.8 billion edge lookups in a single hop. DFS-based planners either explode in memory (tracking all paths) or degenerate into full table scans. Recursive CTEs (DuckDB, PostgreSQL) perform disjunctive joins against 1.8 billion rows per hop — each join slower than the last. In-memory databases (Memgraph) simply cannot hold 3.6 billion edges in RAM.
xrayGraphDB's patent-pending native traversal engine completed 15 hops on 3.6 billion edges in 5.5 seconds. This is not a synthetic benchmark — Friendster is a real social graph with extreme skew, giant hubs, and combinatorial frontier explosions.
Cypher BFS: 12.16 Billion Paths at Hop 9
Variable-length Cypher path expansion on Friendster. MATCH (p)-[:KNOWS*1..N]-(f) RETURN count(f) —
raw path count, no DISTINCT. This is the combinatorial explosion that destroys every other database.
12.16 billion paths
Measured, not estimated. Cypher variable-length expansion. 28.4 minutes at hop 9.
Each hop multiplies the frontier by the average degree (~55). By hop 9, the path count exceeds 12 billion. No other database we tested survived past hop 4 on Friendster using Cypher-style path expansion. xrayGraphDB kept going — we stopped at hop 9 because the result was proven, not because the engine failed.
GPU-Accelerated Analytics
GPU analytics on Friendster (3.6B edges) using the RTX PRO 6000 Blackwell (96 GB VRAM, Docker container). GPU kernels compiled at startup — no CUDA toolkit dependency at runtime.
| Analytics Procedure | Dataset | Time | GPU Util | VRAM |
|---|---|---|---|---|
| Triangle Count | Friendster 3.6B | 38.0s | 85% | 15 GB |
| K-Core Decomposition | Friendster 3.6B | 111.1s | 71% | 15 GB |
| PageRank (20 iterations) | Friendster 3.6B | 94.1s | CPU | — |
| Connected Components | Friendster 3.6B | 38.2s | CPU | — |
| Community Detection (20 iter) | Friendster 3.6B | 274.2s | CPU | — |
| BFS (15 hops, full graph) | Friendster 3.6B | 5.5s | CPU | — |
RTX PRO 6000 Blackwell Server Edition, 96 GB VRAM, SM 12.0, 188 SMs. Docker container. Triangle Count and K-Core dispatch to GPU; PageRank, WCC, Community, and BFS currently run on CPU. Triangle count verified against SNAP ground truth: 4,173,724,142 (exact match).
cuGraph — NVIDIA's own GPU graph library — could not even build the undirected Friendster graph on this same 96 GB GPU. xrayGraphDB ran all algorithms successfully.
Friendster: 3.6 Billion Undirected Edges, One Server
65.6 million vertices. 3.6 billion undirected edges. Loaded in 6.25 minutes at 1.8 million edges/sec. Full analytics suite completed. RTX PRO 6000 Blackwell (96 GB VRAM). May 2026.
BFS from vertex 71768986 (degree 5,214). Native frontier expansion against the resident graph. The graph is fully exhausted — 100% of vertices reached. We could keep going; there's nothing left to find.
| Hop | New Vertices (frontier) | Cumulative Reached | Coverage |
|---|---|---|---|
| 1 | 5,214 | 5,215 | 0.0% |
| 2 | 2,146,248 | 2,151,463 | 3.3% |
| 3 | 32,962,413 | 35,113,876 | 53.5% |
| 4 | 26,126,218 | 61,240,094 | 93.3% |
| 5 | 3,021,288 | 64,261,382 | 97.9% |
| 6 | 892,650 | 65,154,032 | 99.3% |
| 7 | 286,864 | 65,440,896 | 99.7% |
| 8 | 104,146 | 65,545,042 | 99.9% |
| 9 | 39,421 | 65,584,463 | 99.96% |
| 10 | 15,102 | 65,599,565 | 99.99% |
| 11 | 5,526 | 65,605,091 | 99.995% |
| 12 | 2,061 | 65,607,152 | 99.998% |
| 13 | 745 | 65,607,897 | 99.999% |
| 14 | 269 | 65,608,166 | 100.0% |
| 15 | 112 | 65,608,278 | 100.0% |
| Total | graph exhausted | 65,608,278 | 100% |
5.505 seconds
15 hops. 65.6 million vertices. 3.6 billion edges. One server. Docker container. Graph exhausted.
The peak frontier explosion is at hops 3–4: 59 million new vertices discovered in two levels. That's 1.8 billion edge lookups per level, resolved in under a second each. We stopped at hop 15 because the graph was exhausted — not because the engine couldn't continue.
10-hop BFS on 3.6 billion edges: not feasible for everyone else. xrayGraphDB: 476ms.
Friendster Analytics Suite — Three Servers
65.6M vertices, 3.6B undirected edges. Three servers tested: Blackwell GPU (Docker container, 96GB VRAM, 16-core EPYC), Production CPU (503GB, 64-core EPYC, bare-metal), and Budget (62GB, T1000 8GB). The Blackwell numbers are from a Docker container — zero overhead verified. Even the budget server completes analytics that no competitor can attempt on any hardware.
| Procedure | Blackwell GPU (Docker) | Production CPU (503GB) | Budget (62GB, T1000) | Any Competitor* |
|---|---|---|---|---|
| Triangle Count | 38s GPU 85% | 142s | 537s | did not complete* |
| Connected Components | 38s | 75s | OOM (62GB) | did not complete* |
| PageRank (20 iter) | 94s | 231s | OOM (62GB) | did not complete* |
| K-Core | 111s GPU 71% | 125s | OOM (62GB) | did not complete* |
| Community (20 iter) | 274s | 215s | OOM (62GB) | did not complete* |
| BC Pair-Sampled (ε=0.05) | 5.1s | 6.2s | 2.8s | did not complete* |
| BC Pair-Sampled (ε=0.10) | 1.5s | 1.9s | 1.4s | did not complete* |
| Shortest Path (hub-to-hub) | 226ms | 226ms | 439ms | did not complete* |
| Jaccard Similarity | 2.0ms | 2.0ms | 2.2ms | did not complete* |
| Link Prediction | 1ms | 1ms | 3ms | did not complete* |
Triangle count verified against SNAP ground truth: 4,173,724,142 triangles (exact match). Blackwell column is a Docker container — not bare-metal. Zero Docker overhead verified. BC uses the ABRA pair-sampled algorithm. First call initializes; subsequent calls are faster.
38-second triangle count on 3.6 billion edges. GPU at 85%. SNAP-verified exact match.
*We tested 7 competitors on the same Blackwell hardware with the same Friendster dataset. None completed these workloads under the tested single-GPU configuration: cuGraph (OOM building undirected graph), Kuzu (hop 3 timeout), DuckDB (hop 2 timeout), Neo4j (hop 4 timeout), Memgraph (OOM loading), GraphBLAS (OOM during BFS). All scripts and logs published at github.com/eMTAi-Labs/xraygraph-bench.
Data Loading Speed Comparison
LDBC SF1 dataset. Measured on competitor server (187GB RAM, 44-core Xeon Gold 6152). xrayGraphDB Friendster on .187 (503GB RAM, 64-core EPYC).
| Database | Load Rate | Notes |
|---|---|---|
| xrayGraphDB (Bolt) | 261–598K/s | Bolt UNWIND batch loading |
| xrayGraphDB (native) | 6.25 min (Friendster) | 1.8M edges/sec, bulk import |
| DuckDB | 1–5M/s | Columnar bulk COPY, fastest ingest |
| PostgreSQL | 270K–1.2M/s | COPY command, index rebuild |
| MySQL | 100–266K/s | LOAD DATA INFILE |
| Neo4j | 12–14K/s | Cypher LOAD CSV (not admin import) |
| Memgraph | 8–26K/s | LOAD CSV, severe bottleneck |
DuckDB is legitimately fast at bulk ingest — it is a columnar analytics engine optimized for COPY. Graph databases (Neo4j, Memgraph) are 10–100x slower at loading due to index maintenance during insert.
Docker Performance — Zero Overhead
Docker container vs bare-metal on identical hardware and dataset. No performance penalty.
| Metric | Docker | Bare-Metal | Difference |
|---|---|---|---|
| Protocol latency (RETURN 1) | 0.24ms | 0.47ms | Docker faster (noise) |
| Analytics performance | Identical | Identical | Within noise |
Docker uses Linux namespaces and cgroups — no hypervisor, no VM overhead. The 0.24ms vs 0.47ms difference is TCP stack variance, not container overhead.
xrayProtocol vs Bolt — Same Database
xrayGraphDB v4.9.5, same queries, same data. Bare-metal measurements.
| Query | Bolt (7687) | xrayProtocol (7689) | Speedup |
|---|---|---|---|
| RETURN 1 | 0.94ms | 0.47ms | ~2x |
| COUNT all nodes | 1.94ms | 1.20ms | 1.6x |
| LIMIT 100 | 4.75ms | 0.27ms | 17.6x |
xrayProtocol p50 for RETURN 1: 0.47ms (bare-metal). Bolt overhead is approximately 2x on trivial queries. The gap widens dramatically on result-heavy queries (LIMIT 100: 17.6x) due to columnar serialization.
Cross-Database Comparison — LiveJournal (4.8M nodes, 69M edges)
Protocol: Bolt (common denominator for fair comparison)
| Query | xrayGraphDB | Memgraph 2.22 | Speedup |
|---|---|---|---|
| RETURN 1+1 | 0.77ms | 0.89ms | 1.2x |
| COUNT all nodes | 1.94ms | 19.13ms | 9.9x |
| LIMIT 100 | 4.75ms | 5.09ms | 1.1x |
| LIMIT 10,000 | 28.61ms | 449.02ms | 15.7x |
| 1-hop traversal | 1.29ms | 42.96ms | 33.3x |
| 2-hop traversal | 1.43ms | 49.66ms | 34.7x |
| COUNT all edges | 1.53ms | 84.06ms | 54.9x |
Data load: xrayGraphDB 1.4s (Persistent graph store) vs Memgraph 9,668s (failed at 150K of 69M edges)
Competitor Limitations at Friendster Scale
We tested every major graph database, GPU compute engine, and analytics library on Friendster (65.6M vertices, 1.8B edges) with the same Blackwell GPU hardware. Results published transparently — including wins AND failures.
| System | Type | Friendster Result |
|---|---|---|
| cuGraph 26.02 | GPU compute | BFS 82ms (20 GTEPS) but 7/8 algorithms failed — cannot build undirected graph on 96 GB GPU |
| Kuzu 0.11 | Embedded graph DB | Loaded in 218s, hop 1–2 worked (0.2s, 0.9s), hop 3 timed out >600s |
| DuckDB 1.5 | Analytical engine | CSV load in 28s (fastest ingest), hop 1 in 1.8s, hop 2 timed out >600s |
| Neo4j 2025.04 | Graph database | Import 13.6 min, hop 3 in 10s, hop 4 timed out. No GDS in Community edition |
| Memgraph 2.22 | In-memory graph | OOM-killed loading 1.8B edges (exceeded 120 GB MemoryMax in 5 min) |
| GraphBLAS 9.4 | CPU sparse matrix | Loaded 3.6B entries (102s) but OOM during BFS (142 GB RAM insufficient) |
| TigerGraph | Distributed graph | Registration wall — cannot download without enterprise contact |
| FalkorDB | Redis-based graph | Skipped — Redis in-memory architecture will OOM on 1.8B edges |
All scripts, logs, and raw data available at github.com/eMTAi-Labs/xraygraph-bench. Methodology: same hardware, same dataset, same source vertex. Wins AND losses published.
Where the Speed Comes From
Vectorized Pipeline
Column-oriented batch processing tuned for modern CPU caches and vectorized execution.
xrayProtocol
Columnar wire format with LZ4 compression. Results stream column-by-column instead of row-by-row. 24x faster than Bolt.
Plan Cache
AST fingerprinting with 425x speedup. Parameterized queries hit cache immediately. Auto-invalidation on schema changes.
Zero-GC Memory
Per-query memory allocation with zero GC pauses. Deterministic cleanup. No fragmentation, no leaks, no stop-the-world.
SIMD + GPU
SIMD-accelerated graph operations on CPU. GPU compute dispatch for PageRank, triangle count, BFS, K-core, Louvain, and label propagation. Falls back to CPU when no GPU is available.
Streaming Bulk Import
Builds billion-edge graphs on a single server with bounded peak memory. Tuned for NVMe with kernel-level I/O acceleration.
Total Geekout: Reproduce Every Number
Every number on this page is reproducible. Here is exactly how.
Hardware
| Blackwell GPU Server | RTX PRO 6000 Blackwell Server Edition (96 GB VRAM, SM 12.0, 188 SMs) 16 vCPU AMD EPYC 9355, 144 GB RAM, 725 GB SSD Ubuntu 22.04, CUDA driver 580.126.20, Docker container |
| Production CPU Server | 64-core AMD EPYC @ 2.9 GHz, 503 GB RAM, no GPU Bare-metal, Ubuntu 24 |
| Budget Server | 28-core Xeon E5-2650L @ 1.7 GHz, 62 GB RAM, NVIDIA T1000 8 GB Bare-metal |
| Competitor Server | 44-core Xeon Gold 6152, 187 GB RAM, Tesla T4 16 GB Bare-metal. Used for LDBC SF1 competitor testing. |
Dataset
# Friendster (SNAP) wget https://snap.stanford.edu/data/bigdata/communities/com-friendster.ungraph.txt.gz gunzip com-friendster.ungraph.txt.gz # 65,608,366 vertices, 1,806,067,135 undirected edges # Stored as 3,612,134,270 bidirectional edges # File: 31 GB, tab-separated, # comment lines # SHA-256 of uncompressed: verify with sha256sum
xrayGraphDB Setup
# 1. Set kernel parameter (required for large graph builds)
sudo sysctl -w vm.max_map_count=1048576
# 2. Start xrayGraphDB (Docker, GPU-enabled)
docker run -d --user 0:0 --gpus all --shm-size 10g --network=host \
-v /var/lib/xraygraphdb:/var/lib/xraygraphdb \
-v /usr/local/cuda-12.4/targets/x86_64-linux/lib:/usr/local/cuda/lib64:ro \
-e LD_LIBRARY_PATH=/usr/lib/xraygraphdb/lib:/usr/local/cuda/lib64 \
--name xg-bench \
xraygraphdb.emtailabs.com/xraygraphdb:latest \
--data-directory=/var/lib/xraygraphdb \
--bolt-port=7687 --xray-port=7689 \
--storage-engine=mmap --storage-properties-on-edges=true \
--log-level=INFO --also-log-to-stderr=true \
--license-acknowledge-saved=true \
--init-admin-user=admin --init-admin-password=xraygraphdb \
--init-admin-tenant=xraygraphdb
# 3. Load Friendster (copy into container, then import)
docker cp com-friendster.ungraph.txt xg-bench:/tmp/xraygraphdb-import/
# Then via xgdb_connect Python client:
from xgdb_connect.protocol import XrayProtocolClient
c = XrayProtocolClient(host="127.0.0.1", port=7689,
auth_token="admin:xraygraphdb", database="xraygraphdb")
result = c.bulk_import_file("/tmp/xraygraphdb-import/com-friendster.ungraph.txt")
# See scripts/ directory for exact commands
# Result: 65,608,366 vertices, 3,612,134,270 edges in ~375s
Running the Benchmarks
# All scripts at: github.com/eMTAi-Labs/xraygraph-bench/scripts/ # GPU analytics suite (PageRank, TriangleCount, WCC, K-Core, BC, Community) python3 blackwell_gpu_rerun.py # Graph500-style TEPS measurement (16 BFS sources) python3 graph500_teps.py # Apples-to-apples competitor comparison (same source vertex 71768986) python3 apples_to_apples.py # cuGraph head-to-head (same hardware, same dataset) python3 cugraph_bench.py # Individual competitor benchmarks python3 duckdb_bench.py python3 kuzu_bench.py python3 neo4j_gds_bench.py python3 memgraph_bench.py python3 graphblas_bench.py
Standard Source Vertex
All BFS comparisons use vertex 71768986 (undirected degree 5,214 — highest in Friendster). This ensures apples-to-apples comparison across systems. Vertex counts at each hop were verified to match across xrayGraphDB, Kuzu, and DuckDB:
- Hop 1: 5,214–5,215 vertices (all systems agree)
- Hop 2: 2,151,463 vertices (exact match)
- Hop 3: 35,113,876 vertices (only xrayGraphDB reached this)
Algorithm Parameters
| PageRank | 20 iterations, damping=0.85, tolerance=0.0 |
| Triangle Count | Undirected, verified against SNAP ground truth (4,173,724,142) |
| Betweenness Centrality | Approximate, 50 sampled sources, epsilon=0.05 |
| Community Detection | Label propagation, 20 iterations |
| K-Core | Full core decomposition (max core = 304) |
| BFS / TEPS | Native BFS, OUTGOING direction, up to 20 hops |
Competitor Versions Tested
| cuGraph | 26.02.00 (RAPIDS), cuDF 26.02.01, pip install cugraph-cu12 |
| Neo4j | 2025.04.0 Community (tarball), no GDS plugin available |
| Memgraph | 2.22.0 (deb package), native install, no MAGE |
| Kuzu | 0.11.3 (pip install kuzu), embedded |
| DuckDB | 1.5.2 (pip install duckdb), embedded |
| GraphBLAS | SuiteSparse 9.4.5 via python-graphblas 2025.2.0 |
cuGraph Failure Analysis
cuGraph failed 7 of 8 algorithms on the same 96 GB Blackwell GPU. Three separate failures prevented undirected graph construction:
- cuDF CSV parser OOM: cudf.read_csv() consumed 92 GB of 96 GB VRAM parsing the 31 GB text file before crashing.
- CSR sort+symmetrize OOM: After CPU-read fallback, cuGraph's undirected CSR builder OOM'd during radix sort — even with 82 GB VRAM free.
- cudf int32 size_type limit: Pre-symmetrized 3.6B rows exceed cudf's int32 offset maximum (2,147,483,647). Fundamental limitation.
Only a directed graph (1.8B edges) could be constructed. On that directed graph, cuGraph BFS achieved 20.3 GTEPS (82ms) — but PageRank failed (FailedToConvergeError on asymmetric link structure), and all algorithms requiring undirected input returned “input graph must be undirected.”
Important: cuGraph is not “bad.” It is optimized for specialized GPU graph kernels with maximum throughput. xrayGraphDB is optimized for persistent relationship-intelligence workloads at billion-edge scale. These are different system categories with different tradeoffs, and the comparison reflects that distinction honestly.
Raw Data & Scripts
Everything is published at github.com/eMTAi-Labs/xraygraph-bench:
results/BLACKWELL-GPU-RERUN-20260510.md— full GPU rerun analysisresults/CUGRAPH-COMPARISON-20260509.md— cuGraph head-to-head writeupresults/blackwell_gpu_rerun_20260510.json— raw JSON resultsresults/cugraph_blackwell.json— cuGraph raw JSONresults/apples_to_apples_blackwell.log— 7-system BFS comparison logBENCHMARK-METHODOLOGY.md— 15-rule courtroom-clean methodologyREPRODUCIBILITY.md— step-by-step reproduction guide
If you can reproduce a different result, we want to know. File an issue.