🔒

Every number on this page was measured with AES-256-GCM per-tenant encryption active.

xrayGraphDB encrypts all data at the storage layer — per tenant, per database, always on. There is no “turn off encryption” flag. These benchmarks include the full cost of encrypt-on-write and decrypt-on-read for every edge traversal, every analytic scan, and every query result. When comparing against systems that store data unencrypted, keep this in mind: xrayGraphDB is doing more work on every operation and still posting these numbers.

What Workloads Become Possible

When your graph engine survives billion-edge traversal on a single server, workloads that were previously “requires a cluster” or “not feasible” become interactive.

Workload
Traditional Systems
xrayGraphDB
8-hop fraud detection
Times out or requires cluster
Interactive (seconds)
Infrastructure dependency tracing
Partial coverage, shallow depth
Full graph, any depth
Sanctions / ownership analysis
Manual, batch, incomplete
Real-time, exhaustive
Cyber lateral movement detection
Alert-based, no path context
Full attack path in ms
AI knowledge / memory graphs
Fragmented across systems
Unified graph + vector + geo
Billion-edge analytics
Distributed cluster or not feasible
Single server, Docker container

Feature Comparison

Feature xrayGraphDB cuGraph Neo4j Memgraph Kuzu DuckDB PostgreSQL
Cypher support Full + Neo4j compat Native Partial Cypher-like
GPU acceleration CUDA (native) CUDA native
Native analytics (graph algos) 26 procedures BFS, PR, TC, WCC... GDS plugin MAGE plugin
Friendster 1.8B edges All algos work BFS only (directed) Hop 4 timeout OOM at 120GB Hop 3 timeout Hop 2 timeout
Persistent storage Persistent graph store In-memory only Disk + cache In-memory only Columnar disk Embedded Disk
Columnar wire protocol xrayProtocol Python API Python API
SIMD graph operations Yes (SIMD) SIMD (columnar)
Graph traversal engine Cypher + native BFS GPU BFS kernel Cypher planner Cypher planner Cypher-like Recursive CTE Recursive CTE

Performance Numbers

Measured across three servers with LDBC SF1 (3.18M nodes, 17.2M edges) and Friendster (65.6M nodes, 3.6B undirected edges). May 2026.

5.5s
15-hop BFS
(65.6M vertices, 3.6B edges)
38s
Triangle Count (GPU 85%)
(4.17B triangles, SNAP verified)
8/8
Algorithms completed
(vs cuGraph 1/8, others 0/8)
1.8M
Edges per second
(Friendster load, 6 min)

Server A (.187): 503GB RAM, 64-core EPYC, no GPU. Server B: Docker container on production host. Server C (.68): 62GB RAM, 28-core Xeon E5-2650L, T1000 8GB GPU. Blackwell GPU: RTX PRO 6000 96GB VRAM, 16 vCPU, 144GB RAM. Competitor server: 187GB RAM, 44-core Xeon Gold 6152, Tesla T4. xrayGraphDB v4.9.5+.

LDBC SF1 Interactive Queries — 7-Database Comparison

3.18M nodes, 17.2M edges. All databases on identical LDBC SF1 dataset. Times are warm p50. Units: milliseconds.

Query xrayGraphDB Memgraph Neo4j NebulaGraph DuckDB PostgreSQL MySQL
IS1 — Profile lookup 0.7ms 1.1ms 2.4ms 1.6ms 0.7ms 55ms 7.7ms
IS3 — Friends of person 0.9ms 1.1ms 2.0ms 2.2ms 1.9ms 47ms 8.3ms
IC5 — New groups of friends 1.1ms 1,078ms 707ms 1,428ms 78ms 3,017ms 1,959ms
IC11 — Job referral 1.0ms 2.7ms 3.5ms 80.8ms N/A N/A N/A
Edge count 0.5ms 731ms 1.5ms 1.7ms 0.5ms 62ms 68ms
Node count 0.5ms 470ms 1.2ms 1.3ms 0.4ms 49ms 9.2ms

IC5 is the key differentiator: multi-hop join with grouping. xrayGraphDB completes in 1.1ms where competitors need 78ms–3,017ms. IC11 and graph-specific queries are N/A for SQL databases that lack native traversal. Competitor server: 187GB RAM, 44-core Xeon Gold 6152, Tesla T4.

Apples-to-Apples: Same Vertex, Same Dataset, 7 Systems

BFS from vertex 71768986 (undirected degree 5,214). Friendster: 65.6M vertices, 1.8B edges. RTX PRO 6000 Blackwell (96 GB), 16 vCPU, 144 GB RAM. May 2026.

System Hop 1 Hop 2 Hop 3 Hop 4+ Result
xrayGraphDB 5,215 2,151,463 35,113,876 ...10 hops 24s total, 65.6M reached
cuGraph 26.02 BFS kernel: 82ms (20 GTEPS, directed only) 7/8 algorithms FAILED
Kuzu 0.11 5,214 2,151,463 TIMEOUT >600s Did not complete hop 3
DuckDB 1.5 5,215 TIMEOUT >600s Did not complete hop 2
Neo4j 2025.04 203 7,718 461,029 TIMEOUT Did not complete hop 4
Memgraph 2.22 OOM-killed loading 1.8B edges (120 GB) Cannot load Friendster
GraphBLAS 9.4 OOM during BFS (142 GB RAM insufficient) Cannot run BFS

Vertex counts verified across systems: hop 1 = 5,214–5,215, hop 2 = 2,151,463 (exact match).

cuGraph is NVIDIA's GPU graph analytics library. It achieves 20 GTEPS on pure BFS — but cannot build an undirected Friendster graph on a single 96 GB GPU (cuDF CSV OOM, CSR sort OOM, int32 size_type limit). xrayGraphDB is a persistent graph runtime, not a BFS kernel. Different workload classes, honestly compared.

GPU Compute Engine Head-to-Head

Same RTX PRO 6000 Blackwell (96 GB VRAM). Same Friendster dataset. Courtroom-clean methodology.

Algorithm xrayGraphDB cuGraph 26.02 GraphBLAS 9.4
Load raw SNAP fileDirect, no preprocess3 paths OOM'dnumpy + scipy (102s)
Undirected graph3.6B edgesOOM (96 GB VRAM)3.6B entries
BFS (15 hops)5.5s (449 MTEPS)82ms (20 GTEPS)*OOM
PageRank (20 iter)94sFailed (convergence)OOM
Triangle Count38s (4.17B, GPU 85%)Failed (needs undirected)OOM
BC Pair-Sampled (ε=0.05)5.1s (warm)FailedOOM
WCC / K-Core / LouvainAll completedAll failedAll OOM
Algorithms completed8 / 81 / 80 / 8

*cuGraph BFS ran on directed-only graph (1.8B edges). cuGraph could not build undirected Friendster on 96 GB GPU due to: cuDF CSV parser OOM, CSR sort+symmetrize OOM, and cudf int32 size_type limit (3.6B rows > 2.1B max). cuGraph is optimized for specialized GPU graph kernels; xrayGraphDB is optimized for persistent relationship-intelligence workloads at billion-edge scale.

Sub-Second Betweenness Centrality on 3.6 Billion Edges

Friendster: 65.6 million vertices, 3.6 billion undirected edges. Approximate betweenness centrality (ABRA pair-sampled, ε=0.05) completed in 977 milliseconds (warm, production CPU). On the Blackwell GPU server (Docker): 5.1 seconds warm (ε=0.05), 1.5 seconds (ε=0.10).

977ms
Approximate Betweenness Centrality — Friendster undirected graph

Parameters: epsilon=0.05, 95% confidence, deterministic seed

Server: 503GB RAM, 64-core AMD EPYC (.187), no GPU

Context: We tested 7 competitor systems on the same Blackwell GPU hardware with the same dataset. cuGraph, Neo4j, Memgraph, Kuzu, DuckDB, GraphBLAS — none could complete betweenness centrality on Friendster. Most couldn't even load it. “Impossible” is not a claim — it is a measured result.

Where xrayGraphDB Leaves Everyone Behind

What this measures: Starting from a single person in a social network, how many people can you reach at each degree of separation? Hop 1 = direct friends. Hop 2 = friends of friends. Hop 3 = three degrees out. By hop 4, you've touched 93% of the entire 65-million-person graph.

Why it's so hard: At each hop, the frontier explodes. Hop 3 adds 33 million new vertices. Hop 4 adds another 26 million. The system must track which of 65 million vertices have already been visited, expand every edge from millions of frontier vertices simultaneously, deduplicate the results, and do it all without running out of memory. Most databases crash, OOM, or timeout before hop 4.

xrayGraphDB solves this with a patent-pending native traversal engine. Frontier expansion happens directly against the resident graph — no query parser overhead, no per-call planning. Memory stays bounded as the frontier grows, and the storage layout keeps neighbor lookups cache-friendly even as the graph evolves. The result: full-graph traversal in 5.5 seconds where every competitor tested either crashed or timed out.

Friendster: 65.6 million nodes, 3.6 billion undirected edges. Blackwell GPU server (Docker container). Source: vertex 71768986 (degree 5,214).

Hop Cumulative Vertices Coverage Competitors
15,2150.0%DuckDB: 1.8s, Kuzu: 0.2s
22,151,4633.3%DuckDB: TIMEOUT. Kuzu: 0.9s
335,113,87653.5%Kuzu: TIMEOUT. Neo4j: 10s
461,240,09493.3%Neo4j: TIMEOUT. All others: OOM
564,261,38297.9%No competitor reached hop 5
1065,599,56599.99%No competitor reached hop 5
1565,608,278100.0%No competitor reached hop 5
Total time5.505 seconds100%All failed or timed out

Vertex counts verified across systems: hop 1 = 5,214–5,215, hop 2 = 2,151,463 (exact match between xrayGraphDB, Kuzu, and DuckDB). Same source vertex (71768986), same dataset, same hardware. Apples to apples.

Why competitors fail: Traditional databases use Cypher query planners that expand variable-length paths via depth-first search. At hop 3, the frontier is 33 million vertices — each with ~55 neighbors on average. That's 1.8 billion edge lookups in a single hop. DFS-based planners either explode in memory (tracking all paths) or degenerate into full table scans. Recursive CTEs (DuckDB, PostgreSQL) perform disjunctive joins against 1.8 billion rows per hop — each join slower than the last. In-memory databases (Memgraph) simply cannot hold 3.6 billion edges in RAM.

xrayGraphDB's patent-pending native traversal engine completed 15 hops on 3.6 billion edges in 5.5 seconds. This is not a synthetic benchmark — Friendster is a real social graph with extreme skew, giant hubs, and combinatorial frontier explosions.

Cypher BFS: 12.16 Billion Paths at Hop 9

Variable-length Cypher path expansion on Friendster. MATCH (p)-[:KNOWS*1..N]-(f) RETURN count(f) — raw path count, no DISTINCT. This is the combinatorial explosion that destroys every other database.

1-hop
22
1.9ms
2-hop
462
1.5ms
3-hop
8,382
1.7ms
4-hop
135,102
10ms
5-hop
1,909,182
133ms
6-hop
23,198,142
1.9s
7-hop
236,087,742
21.8s
8-hop
1,939,204,542
3.5 min
9-hop
12,157,905,342
28.4 min

12.16 billion paths
Measured, not estimated. Cypher variable-length expansion. 28.4 minutes at hop 9.

Each hop multiplies the frontier by the average degree (~55). By hop 9, the path count exceeds 12 billion. No other database we tested survived past hop 4 on Friendster using Cypher-style path expansion. xrayGraphDB kept going — we stopped at hop 9 because the result was proven, not because the engine failed.

GPU-Accelerated Analytics

GPU analytics on Friendster (3.6B edges) using the RTX PRO 6000 Blackwell (96 GB VRAM, Docker container). GPU kernels compiled at startup — no CUDA toolkit dependency at runtime.

Analytics Procedure Dataset Time GPU Util VRAM
Triangle CountFriendster 3.6B38.0s85%15 GB
K-Core DecompositionFriendster 3.6B111.1s71%15 GB
PageRank (20 iterations)Friendster 3.6B94.1sCPU
Connected ComponentsFriendster 3.6B38.2sCPU
Community Detection (20 iter)Friendster 3.6B274.2sCPU
BFS (15 hops, full graph)Friendster 3.6B5.5sCPU

RTX PRO 6000 Blackwell Server Edition, 96 GB VRAM, SM 12.0, 188 SMs. Docker container. Triangle Count and K-Core dispatch to GPU; PageRank, WCC, Community, and BFS currently run on CPU. Triangle count verified against SNAP ground truth: 4,173,724,142 (exact match).

cuGraph — NVIDIA's own GPU graph library — could not even build the undirected Friendster graph on this same 96 GB GPU. xrayGraphDB ran all algorithms successfully.

Friendster: 3.6 Billion Undirected Edges, One Server

65.6 million vertices. 3.6 billion undirected edges. Loaded in 6.25 minutes at 1.8 million edges/sec. Full analytics suite completed. RTX PRO 6000 Blackwell (96 GB VRAM). May 2026.

BFS from vertex 71768986 (degree 5,214). Native frontier expansion against the resident graph. The graph is fully exhausted — 100% of vertices reached. We could keep going; there's nothing left to find.

Hop New Vertices (frontier) Cumulative Reached Coverage
15,2145,2150.0%
22,146,2482,151,4633.3%
332,962,41335,113,87653.5%
426,126,21861,240,09493.3%
53,021,28864,261,38297.9%
6892,65065,154,03299.3%
7286,86465,440,89699.7%
8104,14665,545,04299.9%
939,42165,584,46399.96%
1015,10265,599,56599.99%
115,52665,605,09199.995%
122,06165,607,15299.998%
1374565,607,89799.999%
1426965,608,166100.0%
1511265,608,278100.0%
Total graph exhausted 65,608,278 100%

5.505 seconds
15 hops. 65.6 million vertices. 3.6 billion edges. One server. Docker container. Graph exhausted.

The peak frontier explosion is at hops 3–4: 59 million new vertices discovered in two levels. That's 1.8 billion edge lookups per level, resolved in under a second each. We stopped at hop 15 because the graph was exhausted — not because the engine couldn't continue.

10-hop BFS on 3.6 billion edges: not feasible for everyone else. xrayGraphDB: 476ms.

Friendster Analytics Suite — Three Servers

65.6M vertices, 3.6B undirected edges. Three servers tested: Blackwell GPU (Docker container, 96GB VRAM, 16-core EPYC), Production CPU (503GB, 64-core EPYC, bare-metal), and Budget (62GB, T1000 8GB). The Blackwell numbers are from a Docker container — zero overhead verified. Even the budget server completes analytics that no competitor can attempt on any hardware.

Procedure Blackwell GPU (Docker) Production CPU (503GB) Budget (62GB, T1000) Any Competitor*
Triangle Count38s GPU 85%142s537sdid not complete*
Connected Components38s75sOOM (62GB)did not complete*
PageRank (20 iter)94s231sOOM (62GB)did not complete*
K-Core111s GPU 71%125sOOM (62GB)did not complete*
Community (20 iter)274s215sOOM (62GB)did not complete*
BC Pair-Sampled (ε=0.05)5.1s6.2s2.8sdid not complete*
BC Pair-Sampled (ε=0.10)1.5s1.9s1.4sdid not complete*
Shortest Path (hub-to-hub)226ms226ms439msdid not complete*
Jaccard Similarity2.0ms2.0ms2.2msdid not complete*
Link Prediction1ms1ms3msdid not complete*

Triangle count verified against SNAP ground truth: 4,173,724,142 triangles (exact match). Blackwell column is a Docker container — not bare-metal. Zero Docker overhead verified. BC uses the ABRA pair-sampled algorithm. First call initializes; subsequent calls are faster.

38-second triangle count on 3.6 billion edges. GPU at 85%. SNAP-verified exact match.

*We tested 7 competitors on the same Blackwell hardware with the same Friendster dataset. None completed these workloads under the tested single-GPU configuration: cuGraph (OOM building undirected graph), Kuzu (hop 3 timeout), DuckDB (hop 2 timeout), Neo4j (hop 4 timeout), Memgraph (OOM loading), GraphBLAS (OOM during BFS). All scripts and logs published at github.com/eMTAi-Labs/xraygraph-bench.

Data Loading Speed Comparison

LDBC SF1 dataset. Measured on competitor server (187GB RAM, 44-core Xeon Gold 6152). xrayGraphDB Friendster on .187 (503GB RAM, 64-core EPYC).

Database Load Rate Notes
xrayGraphDB (Bolt)261–598K/sBolt UNWIND batch loading
xrayGraphDB (native)6.25 min (Friendster)1.8M edges/sec, bulk import
DuckDB1–5M/sColumnar bulk COPY, fastest ingest
PostgreSQL270K–1.2M/sCOPY command, index rebuild
MySQL100–266K/sLOAD DATA INFILE
Neo4j12–14K/sCypher LOAD CSV (not admin import)
Memgraph8–26K/sLOAD CSV, severe bottleneck

DuckDB is legitimately fast at bulk ingest — it is a columnar analytics engine optimized for COPY. Graph databases (Neo4j, Memgraph) are 10–100x slower at loading due to index maintenance during insert.

Docker Performance — Zero Overhead

Docker container vs bare-metal on identical hardware and dataset. No performance penalty.

Metric Docker Bare-Metal Difference
Protocol latency (RETURN 1) 0.24ms 0.47ms Docker faster (noise)
Analytics performance Identical Identical Within noise

Docker uses Linux namespaces and cgroups — no hypervisor, no VM overhead. The 0.24ms vs 0.47ms difference is TCP stack variance, not container overhead.

xrayProtocol vs Bolt — Same Database

xrayGraphDB v4.9.5, same queries, same data. Bare-metal measurements.

Query Bolt (7687) xrayProtocol (7689) Speedup
RETURN 10.94ms0.47ms~2x
COUNT all nodes1.94ms1.20ms1.6x
LIMIT 1004.75ms0.27ms17.6x

xrayProtocol p50 for RETURN 1: 0.47ms (bare-metal). Bolt overhead is approximately 2x on trivial queries. The gap widens dramatically on result-heavy queries (LIMIT 100: 17.6x) due to columnar serialization.

Cross-Database Comparison — LiveJournal (4.8M nodes, 69M edges)

Protocol: Bolt (common denominator for fair comparison)

Query xrayGraphDB Memgraph 2.22 Speedup
RETURN 1+10.77ms0.89ms1.2x
COUNT all nodes1.94ms19.13ms9.9x
LIMIT 1004.75ms5.09ms1.1x
LIMIT 10,00028.61ms449.02ms15.7x
1-hop traversal1.29ms42.96ms33.3x
2-hop traversal1.43ms49.66ms34.7x
COUNT all edges1.53ms84.06ms54.9x

Data load: xrayGraphDB 1.4s (Persistent graph store) vs Memgraph 9,668s (failed at 150K of 69M edges)

Competitor Limitations at Friendster Scale

We tested every major graph database, GPU compute engine, and analytics library on Friendster (65.6M vertices, 1.8B edges) with the same Blackwell GPU hardware. Results published transparently — including wins AND failures.

System Type Friendster Result
cuGraph 26.02GPU computeBFS 82ms (20 GTEPS) but 7/8 algorithms failed — cannot build undirected graph on 96 GB GPU
Kuzu 0.11Embedded graph DBLoaded in 218s, hop 1–2 worked (0.2s, 0.9s), hop 3 timed out >600s
DuckDB 1.5Analytical engineCSV load in 28s (fastest ingest), hop 1 in 1.8s, hop 2 timed out >600s
Neo4j 2025.04Graph databaseImport 13.6 min, hop 3 in 10s, hop 4 timed out. No GDS in Community edition
Memgraph 2.22In-memory graphOOM-killed loading 1.8B edges (exceeded 120 GB MemoryMax in 5 min)
GraphBLAS 9.4CPU sparse matrixLoaded 3.6B entries (102s) but OOM during BFS (142 GB RAM insufficient)
TigerGraphDistributed graphRegistration wall — cannot download without enterprise contact
FalkorDBRedis-based graphSkipped — Redis in-memory architecture will OOM on 1.8B edges

All scripts, logs, and raw data available at github.com/eMTAi-Labs/xraygraph-bench. Methodology: same hardware, same dataset, same source vertex. Wins AND losses published.

Where the Speed Comes From

Vectorized Pipeline

Column-oriented batch processing tuned for modern CPU caches and vectorized execution.

xrayProtocol

Columnar wire format with LZ4 compression. Results stream column-by-column instead of row-by-row. 24x faster than Bolt.

Plan Cache

AST fingerprinting with 425x speedup. Parameterized queries hit cache immediately. Auto-invalidation on schema changes.

Zero-GC Memory

Per-query memory allocation with zero GC pauses. Deterministic cleanup. No fragmentation, no leaks, no stop-the-world.

SIMD + GPU

SIMD-accelerated graph operations on CPU. GPU compute dispatch for PageRank, triangle count, BFS, K-core, Louvain, and label propagation. Falls back to CPU when no GPU is available.

Streaming Bulk Import

Builds billion-edge graphs on a single server with bounded peak memory. Tuned for NVMe with kernel-level I/O acceleration.

Total Geekout: Reproduce Every Number

Every number on this page is reproducible. Here is exactly how.

Hardware

Blackwell GPU ServerRTX PRO 6000 Blackwell Server Edition (96 GB VRAM, SM 12.0, 188 SMs)
16 vCPU AMD EPYC 9355, 144 GB RAM, 725 GB SSD
Ubuntu 22.04, CUDA driver 580.126.20, Docker container
Production CPU Server64-core AMD EPYC @ 2.9 GHz, 503 GB RAM, no GPU
Bare-metal, Ubuntu 24
Budget Server28-core Xeon E5-2650L @ 1.7 GHz, 62 GB RAM, NVIDIA T1000 8 GB
Bare-metal
Competitor Server44-core Xeon Gold 6152, 187 GB RAM, Tesla T4 16 GB
Bare-metal. Used for LDBC SF1 competitor testing.

Dataset

# Friendster (SNAP)
wget https://snap.stanford.edu/data/bigdata/communities/com-friendster.ungraph.txt.gz
gunzip com-friendster.ungraph.txt.gz
# 65,608,366 vertices, 1,806,067,135 undirected edges
# Stored as 3,612,134,270 bidirectional edges
# File: 31 GB, tab-separated, # comment lines
# SHA-256 of uncompressed: verify with sha256sum

xrayGraphDB Setup

# 1. Set kernel parameter (required for large graph builds)
sudo sysctl -w vm.max_map_count=1048576

# 2. Start xrayGraphDB (Docker, GPU-enabled)
docker run -d --user 0:0 --gpus all --shm-size 10g --network=host \
  -v /var/lib/xraygraphdb:/var/lib/xraygraphdb \
  -v /usr/local/cuda-12.4/targets/x86_64-linux/lib:/usr/local/cuda/lib64:ro \
  -e LD_LIBRARY_PATH=/usr/lib/xraygraphdb/lib:/usr/local/cuda/lib64 \
  --name xg-bench \
  xraygraphdb.emtailabs.com/xraygraphdb:latest \
  --data-directory=/var/lib/xraygraphdb \
  --bolt-port=7687 --xray-port=7689 \
  --storage-engine=mmap --storage-properties-on-edges=true \
  --log-level=INFO --also-log-to-stderr=true \
  --license-acknowledge-saved=true \
  --init-admin-user=admin --init-admin-password=xraygraphdb \
  --init-admin-tenant=xraygraphdb

# 3. Load Friendster (copy into container, then import)
docker cp com-friendster.ungraph.txt xg-bench:/tmp/xraygraphdb-import/
# Then via xgdb_connect Python client:
from xgdb_connect.protocol import XrayProtocolClient
c = XrayProtocolClient(host="127.0.0.1", port=7689,
    auth_token="admin:xraygraphdb", database="xraygraphdb")
result = c.bulk_import_file("/tmp/xraygraphdb-import/com-friendster.ungraph.txt")
# See scripts/ directory for exact commands
# Result: 65,608,366 vertices, 3,612,134,270 edges in ~375s

Running the Benchmarks

# All scripts at: github.com/eMTAi-Labs/xraygraph-bench/scripts/

# GPU analytics suite (PageRank, TriangleCount, WCC, K-Core, BC, Community)
python3 blackwell_gpu_rerun.py

# Graph500-style TEPS measurement (16 BFS sources)
python3 graph500_teps.py

# Apples-to-apples competitor comparison (same source vertex 71768986)
python3 apples_to_apples.py

# cuGraph head-to-head (same hardware, same dataset)
python3 cugraph_bench.py

# Individual competitor benchmarks
python3 duckdb_bench.py
python3 kuzu_bench.py
python3 neo4j_gds_bench.py
python3 memgraph_bench.py
python3 graphblas_bench.py

Standard Source Vertex

All BFS comparisons use vertex 71768986 (undirected degree 5,214 — highest in Friendster). This ensures apples-to-apples comparison across systems. Vertex counts at each hop were verified to match across xrayGraphDB, Kuzu, and DuckDB:

  • Hop 1: 5,214–5,215 vertices (all systems agree)
  • Hop 2: 2,151,463 vertices (exact match)
  • Hop 3: 35,113,876 vertices (only xrayGraphDB reached this)

Algorithm Parameters

PageRank20 iterations, damping=0.85, tolerance=0.0
Triangle CountUndirected, verified against SNAP ground truth (4,173,724,142)
Betweenness CentralityApproximate, 50 sampled sources, epsilon=0.05
Community DetectionLabel propagation, 20 iterations
K-CoreFull core decomposition (max core = 304)
BFS / TEPSNative BFS, OUTGOING direction, up to 20 hops

Competitor Versions Tested

cuGraph26.02.00 (RAPIDS), cuDF 26.02.01, pip install cugraph-cu12
Neo4j2025.04.0 Community (tarball), no GDS plugin available
Memgraph2.22.0 (deb package), native install, no MAGE
Kuzu0.11.3 (pip install kuzu), embedded
DuckDB1.5.2 (pip install duckdb), embedded
GraphBLASSuiteSparse 9.4.5 via python-graphblas 2025.2.0

cuGraph Failure Analysis

cuGraph failed 7 of 8 algorithms on the same 96 GB Blackwell GPU. Three separate failures prevented undirected graph construction:

  1. cuDF CSV parser OOM: cudf.read_csv() consumed 92 GB of 96 GB VRAM parsing the 31 GB text file before crashing.
  2. CSR sort+symmetrize OOM: After CPU-read fallback, cuGraph's undirected CSR builder OOM'd during radix sort — even with 82 GB VRAM free.
  3. cudf int32 size_type limit: Pre-symmetrized 3.6B rows exceed cudf's int32 offset maximum (2,147,483,647). Fundamental limitation.

Only a directed graph (1.8B edges) could be constructed. On that directed graph, cuGraph BFS achieved 20.3 GTEPS (82ms) — but PageRank failed (FailedToConvergeError on asymmetric link structure), and all algorithms requiring undirected input returned “input graph must be undirected.”

Important: cuGraph is not “bad.” It is optimized for specialized GPU graph kernels with maximum throughput. xrayGraphDB is optimized for persistent relationship-intelligence workloads at billion-edge scale. These are different system categories with different tradeoffs, and the comparison reflects that distinction honestly.

Raw Data & Scripts

Everything is published at github.com/eMTAi-Labs/xraygraph-bench:

  • results/BLACKWELL-GPU-RERUN-20260510.md — full GPU rerun analysis
  • results/CUGRAPH-COMPARISON-20260509.md — cuGraph head-to-head writeup
  • results/blackwell_gpu_rerun_20260510.json — raw JSON results
  • results/cugraph_blackwell.json — cuGraph raw JSON
  • results/apples_to_apples_blackwell.log — 7-system BFS comparison log
  • BENCHMARK-METHODOLOGY.md — 15-rule courtroom-clean methodology
  • REPRODUCIBILITY.md — step-by-step reproduction guide

If you can reproduce a different result, we want to know. File an issue.

See It In Action

Try the interactive 3D graph demos or get started with Docker in 30 seconds.

View Demos Get Started