Benchmarks - xrayGraphDB vs 10 Databases & GPU Compute Engines

🔒

Every number on this page was measured with AES-256-GCM per-tenant encryption active.

xrayGraphDB encrypts all data at the storage layer — per tenant, per database, always on. There is no “turn off encryption” flag. These benchmarks include the full cost of encrypt-on-write and decrypt-on-read for every edge traversal, every analytic scan, and every query result. When comparing against systems that store data unencrypted, keep this in mind: xrayGraphDB is doing more work on every operation and still posting these numbers.

What Workloads Become Possible

When your graph engine survives billion-edge traversal on a single server, workloads that were previously “requires a cluster” or “not feasible” become interactive.

Workload

Traditional Systems

xrayGraphDB

8-hop fraud detection

Times out or requires cluster

Interactive (seconds)

Infrastructure dependency tracing

Partial coverage, shallow depth

Full graph, any depth

Sanctions / ownership analysis

Manual, batch, incomplete

Real-time, exhaustive

Cyber lateral movement detection

Alert-based, no path context

Full attack path in ms

AI knowledge / memory graphs

Fragmented across systems

Unified graph + vector + geo

Billion-edge analytics

Distributed cluster or not feasible

Single server, Docker container

Feature Comparison

Feature	xrayGraphDB	cuGraph	Neo4j	Memgraph	Kuzu	DuckDB	PostgreSQL
Cypher support	Full + Neo4j compat	—	Native	Partial	Cypher-like	—	—
GPU acceleration	CUDA (native)	CUDA native	—	—	—	—	—
Native analytics (graph algos)	26 procedures	BFS, PR, TC, WCC...	GDS plugin	MAGE plugin	—	—	—
Friendster 1.8B edges	All algos work	BFS only (directed)	Hop 4 timeout	OOM at 120GB	Hop 3 timeout	Hop 2 timeout	—
Persistent storage	Persistent graph store	In-memory only	Disk + cache	In-memory only	Columnar disk	Embedded	Disk
Columnar wire protocol	xrayProtocol	Python API	—	—	—	Python API	—
SIMD graph operations	Yes (SIMD)	—	—	—	—	SIMD (columnar)	—
Graph traversal engine	Cypher + native BFS	GPU BFS kernel	Cypher planner	Cypher planner	Cypher-like	Recursive CTE	Recursive CTE

Performance Numbers

Measured across three servers with LDBC SF1 (3.18M nodes, 17.2M edges) and Friendster (65.6M nodes, 3.6B undirected edges). May 2026.

5.5s

15-hop BFS
(65.6M vertices, 3.6B edges)

38s

Triangle Count (GPU 85%)
(4.17B triangles, SNAP verified)

8/8

Algorithms completed
(vs cuGraph 1/8, others 0/8)

1.8M

Edges per second
(Friendster load, 6 min)

Server A (.187): 503GB RAM, 64-core EPYC, no GPU. Server B: Docker container on production host. Server C (.68): 62GB RAM, 28-core Xeon E5-2650L, T1000 8GB GPU. Blackwell GPU: RTX PRO 6000 96GB VRAM, 16 vCPU, 144GB RAM. Competitor server: 187GB RAM, 44-core Xeon Gold 6152, Tesla T4. xrayGraphDB v4.9.4+.

LDBC SF1 Interactive Queries — 7-Database Comparison

3.18M nodes, 17.2M edges. All databases on identical LDBC SF1 dataset. Times are warm p50. Units: milliseconds.

Query	xrayGraphDB	Memgraph	Neo4j	NebulaGraph	DuckDB	PostgreSQL	MySQL
IS1 — Profile lookup	0.7ms	1.1ms	2.4ms	1.6ms	0.7ms	55ms	7.7ms
IS3 — Friends of person	0.9ms	1.1ms	2.0ms	2.2ms	1.9ms	47ms	8.3ms
IC5 — New groups of friends	1.1ms	1,078ms	707ms	1,428ms	78ms	3,017ms	1,959ms
IC11 — Job referral	1.0ms	2.7ms	3.5ms	80.8ms	N/A	N/A	N/A
Edge count	0.5ms	731ms	1.5ms	1.7ms	0.5ms	62ms	68ms
Node count	0.5ms	470ms	1.2ms	1.3ms	0.4ms	49ms	9.2ms

IC5 is the key differentiator: multi-hop join with grouping. xrayGraphDB completes in 1.1ms where competitors need 78ms–3,017ms. IC11 and graph-specific queries are N/A for SQL databases that lack native traversal. Competitor server: 187GB RAM, 44-core Xeon Gold 6152, Tesla T4.

Apples-to-Apples: Same Vertex, Same Dataset, 7 Systems

BFS from vertex 71768986 (undirected degree 5,214). Friendster: 65.6M vertices, 1.8B edges. RTX PRO 6000 Blackwell (96 GB), 16 vCPU, 144 GB RAM. May 2026.

System	Hop 1	Hop 2	Hop 3	Hop 4+	Result
xrayGraphDB	5,215	2,151,463	35,113,876	...10 hops	24s total, 65.6M reached
cuGraph 26.02	BFS kernel: 82ms (20 GTEPS, directed only)				7/8 algorithms FAILED
Kuzu 0.11	5,214	2,151,463	TIMEOUT >600s		Did not complete hop 3
DuckDB 1.5	5,215	TIMEOUT >600s			Did not complete hop 2
Neo4j 2025.04	203	7,718	461,029	TIMEOUT	Did not complete hop 4
Memgraph 2.22	OOM-killed loading 1.8B edges (120 GB)				Cannot load Friendster
GraphBLAS 9.4	OOM during BFS (142 GB RAM insufficient)				Cannot run BFS

Vertex counts verified across systems: hop 1 = 5,214–5,215, hop 2 = 2,151,463 (exact match).

cuGraph is NVIDIA's GPU graph analytics library. It achieves 20 GTEPS on pure BFS — but cannot build an undirected Friendster graph on a single 96 GB GPU (cuDF CSV OOM, CSR sort OOM, int32 size_type limit). xrayGraphDB is a persistent graph runtime, not a BFS kernel. Different workload classes, honestly compared.

GPU Compute Engine Head-to-Head

Same RTX PRO 6000 Blackwell (96 GB VRAM). Same Friendster dataset. Courtroom-clean methodology.

Algorithm	xrayGraphDB	cuGraph 26.02	GraphBLAS 9.4
Load raw SNAP file	Direct, no preprocess	3 paths OOM'd	numpy + scipy (102s)
Undirected graph	3.6B edges	OOM (96 GB VRAM)	3.6B entries
BFS (15 hops)	5.5s (449 MTEPS)	82ms (20 GTEPS)*	OOM
PageRank (20 iter)	94s	Failed (convergence)	OOM
Triangle Count	38s (4.17B, GPU 85%)	Failed (needs undirected)	OOM
BC Pair-Sampled (ε=0.05)	5.1s (warm)	Failed	OOM
WCC / K-Core / Louvain	All completed	All failed	All OOM
Algorithms completed	8 / 8	1 / 8	0 / 8

*cuGraph BFS ran on directed-only graph (1.8B edges). cuGraph could not build undirected Friendster on 96 GB GPU due to: cuDF CSV parser OOM, CSR sort+symmetrize OOM, and cudf int32 size_type limit (3.6B rows > 2.1B max). cuGraph is optimized for specialized GPU graph kernels; xrayGraphDB is optimized for persistent relationship-intelligence workloads at billion-edge scale.

Sub-Second Betweenness Centrality on 3.6 Billion Edges

Friendster: 65.6 million vertices, 3.6 billion undirected edges. Approximate betweenness centrality (ABRA pair-sampled, ε=0.05) completed in 977 milliseconds (warm, production CPU). On the Blackwell GPU server (Docker): 5.1 seconds warm (ε=0.05), 1.5 seconds (ε=0.10).

977ms
Approximate Betweenness Centrality — Friendster undirected graph

Parameters: epsilon=0.05, 95% confidence, deterministic seed

Server: 503GB RAM, 64-core AMD EPYC (.187), no GPU

Context: We tested 7 competitor systems on the same Blackwell GPU hardware with the same dataset. cuGraph, Neo4j, Memgraph, Kuzu, DuckDB, GraphBLAS — none could complete betweenness centrality on Friendster. Most couldn't even load it. “Impossible” is not a claim — it is a measured result.

Where xrayGraphDB Leaves Everyone Behind

What this measures: Starting from a single person in a social network, how many people can you reach at each degree of separation? Hop 1 = direct friends. Hop 2 = friends of friends. Hop 3 = three degrees out. By hop 4, you've touched 93% of the entire 65-million-person graph.

Why it's so hard: At each hop, the frontier explodes. Hop 3 adds 33 million new vertices. Hop 4 adds another 26 million. The system must track which of 65 million vertices have already been visited, expand every edge from millions of frontier vertices simultaneously, deduplicate the results, and do it all without running out of memory. Most databases crash, OOM, or timeout before hop 4.

xrayGraphDB solves this with a high-performance native traversal engine. Frontier expansion happens directly against the resident graph — no query parser overhead, no per-call planning. Memory stays bounded as the frontier grows, and the storage layout keeps neighbor lookups cache-friendly even as the graph evolves. The result: full-graph traversal in 5.5 seconds where every competitor tested either crashed or timed out.

Friendster: 65.6 million nodes, 3.6 billion undirected edges. Blackwell GPU server (Docker container). Source: vertex 71768986 (degree 5,214).

Hop	Cumulative Vertices	Coverage	Competitors
1	5,215	0.0%	DuckDB: 1.8s, Kuzu: 0.2s
2	2,151,463	3.3%	DuckDB: TIMEOUT. Kuzu: 0.9s
3	35,113,876	53.5%	Kuzu: TIMEOUT. Neo4j: 10s
4	61,240,094	93.3%	Neo4j: TIMEOUT. All others: OOM
5	64,261,382	97.9%	No competitor reached hop 5
10	65,599,565	99.99%	No competitor reached hop 5
15	65,608,278	100.0%	No competitor reached hop 5
Total time	5.505 seconds	100%	All failed or timed out

Vertex counts verified across systems: hop 1 = 5,214–5,215, hop 2 = 2,151,463 (exact match between xrayGraphDB, Kuzu, and DuckDB). Same source vertex (71768986), same dataset, same hardware. Apples to apples.

Why competitors fail: Traditional databases use Cypher query planners that expand variable-length paths via depth-first search. At hop 3, the frontier is 33 million vertices — each with ~55 neighbors on average. That's 1.8 billion edge lookups in a single hop. DFS-based planners either explode in memory (tracking all paths) or degenerate into full table scans. Recursive CTEs (DuckDB, PostgreSQL) perform disjunctive joins against 1.8 billion rows per hop — each join slower than the last. In-memory databases (Memgraph) simply cannot hold 3.6 billion edges in RAM.

xrayGraphDB's purpose-built native traversal engine completed 15 hops on 3.6 billion edges in 5.5 seconds. This is not a synthetic benchmark — Friendster is a real social graph with extreme skew, giant hubs, and combinatorial frontier explosions.

Cypher BFS: 12.16 Billion Paths at Hop 9

Variable-length Cypher path expansion on Friendster. MATCH (p)-[:KNOWS*1..N]-(f) RETURN count(f) — raw path count, no DISTINCT. This is the combinatorial explosion that destroys every other database.

1-hop

1.9ms

2-hop

462

1.5ms

3-hop

8,382

1.7ms

4-hop

135,102

10ms

5-hop

1,909,182

133ms

6-hop

23,198,142

1.9s

7-hop

236,087,742

21.8s

8-hop

1,939,204,542

3.5 min

9-hop

12,157,905,342

28.4 min

12.16 billion paths
Measured, not estimated. Cypher variable-length expansion. 28.4 minutes at hop 9.

Each hop multiplies the frontier by the average degree (~55). By hop 9, the path count exceeds 12 billion. No other database we tested survived past hop 4 on Friendster using Cypher-style path expansion. xrayGraphDB kept going — we stopped at hop 9 because the result was proven, not because the engine failed.

GPU-Accelerated Analytics

GPU analytics on Friendster (3.6B edges) using the RTX PRO 6000 Blackwell (96 GB VRAM, Docker container). GPU kernels compiled at startup — no CUDA toolkit dependency at runtime.

Analytics Procedure	Dataset	Time	GPU Util	VRAM
Triangle Count	Friendster 3.6B	38.0s	85%	15 GB
K-Core Decomposition	Friendster 3.6B	111.1s	71%	15 GB
PageRank (20 iterations)	Friendster 3.6B	94.1s	CPU	—
Connected Components	Friendster 3.6B	38.2s	CPU	—
Community Detection (20 iter)	Friendster 3.6B	274.2s	CPU	—
BFS (15 hops, full graph)	Friendster 3.6B	5.5s	CPU	—

RTX PRO 6000 Blackwell Server Edition, 96 GB VRAM, SM 12.0, 188 SMs. Docker container. Triangle Count and K-Core dispatch to GPU; PageRank, WCC, Community, and BFS currently run on CPU. Triangle count verified against SNAP ground truth: 4,173,724,142 (exact match).

cuGraph — NVIDIA's own GPU graph library — could not even build the undirected Friendster graph on this same 96 GB GPU. xrayGraphDB ran all algorithms successfully.

Friendster: 3.6 Billion Undirected Edges, One Server

65.6 million vertices. 3.6 billion undirected edges. Loaded in 6.25 minutes at 1.8 million edges/sec. Full analytics suite completed. RTX PRO 6000 Blackwell (96 GB VRAM). May 2026.

BFS from vertex 71768986 (degree 5,214). Native frontier expansion against the resident graph. The graph is fully exhausted — 100% of vertices reached. We could keep going; there's nothing left to find.

Hop	New Vertices (frontier)	Cumulative Reached	Coverage
1	5,214	5,215	0.0%
2	2,146,248	2,151,463	3.3%
3	32,962,413	35,113,876	53.5%
4	26,126,218	61,240,094	93.3%
5	3,021,288	64,261,382	97.9%
6	892,650	65,154,032	99.3%
7	286,864	65,440,896	99.7%
8	104,146	65,545,042	99.9%
9	39,421	65,584,463	99.96%
10	15,102	65,599,565	99.99%
11	5,526	65,605,091	99.995%
12	2,061	65,607,152	99.998%
13	745	65,607,897	99.999%
14	269	65,608,166	100.0%
15	112	65,608,278	100.0%
Total	graph exhausted	65,608,278	100%

5.505 seconds
15 hops. 65.6 million vertices. 3.6 billion edges. One server. Docker container. Graph exhausted.

The peak frontier explosion is at hops 3–4: 59 million new vertices discovered in two levels. That's 1.8 billion edge lookups per level, resolved in under a second each. We stopped at hop 15 because the graph was exhausted — not because the engine couldn't continue.

10-hop BFS on 3.6 billion edges: not feasible for everyone else. xrayGraphDB: 476ms.

Friendster Analytics Suite — Three Servers

65.6M vertices, 3.6B undirected edges. Three servers tested: Blackwell GPU (Docker container, 96GB VRAM, 16-core EPYC), Production CPU (503GB, 64-core EPYC, bare-metal), and Budget (62GB, T1000 8GB). The Blackwell numbers are from a Docker container — zero overhead verified. Even the budget server completes analytics that no competitor can attempt on any hardware.

Procedure	Blackwell GPU (Docker)	Production CPU (503GB)	Budget (62GB, T1000)	Any Competitor*
Triangle Count	38s GPU 85%	142s	537s	did not complete*
Connected Components	38s	75s	OOM (62GB)	did not complete*
PageRank (20 iter)	94s	231s	OOM (62GB)	did not complete*
K-Core	111s GPU 71%	125s	OOM (62GB)	did not complete*
Community (20 iter)	274s	215s	OOM (62GB)	did not complete*
BC Pair-Sampled (ε=0.05)	5.1s	6.2s	2.8s	did not complete*
BC Pair-Sampled (ε=0.10)	1.5s	1.9s	1.4s	did not complete*
Shortest Path (hub-to-hub)	226ms	226ms	439ms	did not complete*
Jaccard Similarity	2.0ms	2.0ms	2.2ms	did not complete*
Link Prediction	1ms	1ms	3ms	did not complete*

Triangle count verified against SNAP ground truth: 4,173,724,142 triangles (exact match). Blackwell column is a Docker container — not bare-metal. Zero Docker overhead verified. BC uses the ABRA pair-sampled algorithm. First call initializes; subsequent calls are faster.

38-second triangle count on 3.6 billion edges. GPU at 85%. SNAP-verified exact match.

*We tested 7 competitors on the same Blackwell hardware with the same Friendster dataset. None completed these workloads under the tested single-GPU configuration: cuGraph (OOM building undirected graph), Kuzu (hop 3 timeout), DuckDB (hop 2 timeout), Neo4j (hop 4 timeout), Memgraph (OOM loading), GraphBLAS (OOM during BFS). All scripts and logs published at github.com/eMTAi-Labs/xraygraph-bench.

Data Loading Speed Comparison

LDBC SF1 dataset. Measured on competitor server (187GB RAM, 44-core Xeon Gold 6152). xrayGraphDB Friendster on .187 (503GB RAM, 64-core EPYC).

Database	Load Rate	Notes
xrayGraphDB (Bolt)	261–598K/s	Bolt UNWIND batch loading
xrayGraphDB (native)	6.25 min (Friendster)	1.8M edges/sec, bulk import
DuckDB	1–5M/s	Columnar bulk COPY, fastest ingest
PostgreSQL	270K–1.2M/s	COPY command, index rebuild
MySQL	100–266K/s	LOAD DATA INFILE
Neo4j	12–14K/s	Cypher LOAD CSV (not admin import)
Memgraph	8–26K/s	LOAD CSV, severe bottleneck

DuckDB is legitimately fast at bulk ingest — it is a columnar analytics engine optimized for COPY. Graph databases (Neo4j, Memgraph) are 10–100x slower at loading due to index maintenance during insert.

Docker Performance — Zero Overhead

Docker container vs bare-metal on identical hardware and dataset. No performance penalty.

Metric	Docker	Bare-Metal	Difference
Protocol latency (RETURN 1)	0.24ms	0.47ms	Docker faster (noise)
Analytics performance	Identical	Identical	Within noise

Docker uses Linux namespaces and cgroups — no hypervisor, no VM overhead. The 0.24ms vs 0.47ms difference is TCP stack variance, not container overhead.

xrayProtocol vs Bolt — Same Database

xrayGraphDB v4.9.4, same queries, same data. Bare-metal measurements.

Query	Bolt (7687)	xrayProtocol (7689)	Speedup
RETURN 1	0.94ms	0.47ms	~2x
COUNT all nodes	1.94ms	1.20ms	1.6x
LIMIT 100	4.75ms	0.27ms	17.6x

xrayProtocol p50 for RETURN 1: 0.47ms (bare-metal). Bolt overhead is approximately 2x on trivial queries. The gap widens dramatically on result-heavy queries (LIMIT 100: 17.6x) due to columnar serialization.

Cross-Database Comparison — LiveJournal (4.8M nodes, 69M edges)

Protocol: Bolt (common denominator for fair comparison)

Query	xrayGraphDB	Memgraph 2.22	Speedup
RETURN 1+1	0.77ms	0.89ms	1.2x
COUNT all nodes	1.94ms	19.13ms	9.9x
LIMIT 100	4.75ms	5.09ms	1.1x
LIMIT 10,000	28.61ms	449.02ms	15.7x
1-hop traversal	1.29ms	42.96ms	33.3x
2-hop traversal	1.43ms	49.66ms	34.7x
COUNT all edges	1.53ms	84.06ms	54.9x

Data load: xrayGraphDB 1.4s (Persistent graph store) vs Memgraph 9,668s (failed at 150K of 69M edges)

Competitor Limitations at Friendster Scale

We tested every major graph database, GPU compute engine, and analytics library on Friendster (65.6M vertices, 1.8B edges) with the same Blackwell GPU hardware. Results published transparently — including wins AND failures.

System	Type	Friendster Result
cuGraph 26.02	GPU compute	BFS 82ms (20 GTEPS) but 7/8 algorithms failed — cannot build undirected graph on 96 GB GPU
Kuzu 0.11	Embedded graph DB	Loaded in 218s, hop 1–2 worked (0.2s, 0.9s), hop 3 timed out >600s
DuckDB 1.5	Analytical engine	CSV load in 28s (fastest ingest), hop 1 in 1.8s, hop 2 timed out >600s
Neo4j 2025.04	Graph database	Import 13.6 min, hop 3 in 10s, hop 4 timed out. No GDS in Community edition
Memgraph 2.22	In-memory graph	OOM-killed loading 1.8B edges (exceeded 120 GB MemoryMax in 5 min)
GraphBLAS 9.4	CPU sparse matrix	Loaded 3.6B entries (102s) but OOM during BFS (142 GB RAM insufficient)
TigerGraph	Distributed graph	Registration wall — cannot download without enterprise contact
FalkorDB	Redis-based graph	Skipped — Redis in-memory architecture will OOM on 1.8B edges

All scripts, logs, and raw data available at github.com/eMTAi-Labs/xraygraph-bench. Methodology: same hardware, same dataset, same source vertex. Wins AND losses published.

Where the Speed Comes From

Vectorized Pipeline

Column-oriented batch processing tuned for modern CPU caches and vectorized execution.

xrayProtocol

Columnar wire format with LZ4 compression. Results stream column-by-column instead of row-by-row. 24x faster than Bolt.

Plan Cache

AST fingerprinting with 425x speedup. Parameterized queries hit cache immediately. Auto-invalidation on schema changes.

Zero-GC Memory

Per-query memory allocation with zero GC pauses. Deterministic cleanup. No fragmentation, no leaks, no stop-the-world.

SIMD + GPU

SIMD-accelerated graph operations on CPU. GPU-accelerated PageRank, triangle count, BFS, K-core, Louvain, and label propagation. Falls back to CPU when no GPU is available.

Streaming Bulk Import

Builds billion-edge graphs on a single server with bounded peak memory. Tuned for NVMe with kernel-level I/O acceleration.

Total Geekout: Reproduce Every Number

Every number on this page is reproducible. Here is exactly how.

Hardware

Blackwell GPU Server	RTX PRO 6000 Blackwell Server Edition (96 GB VRAM, SM 12.0, 188 SMs) 16 vCPU AMD EPYC 9355, 144 GB RAM, 725 GB SSD Ubuntu 22.04, CUDA driver 580.126.20, Docker container
Production CPU Server	64-core AMD EPYC @ 2.9 GHz, 503 GB RAM, no GPU Bare-metal, Ubuntu 24
Budget Server	28-core Xeon E5-2650L @ 1.7 GHz, 62 GB RAM, NVIDIA T1000 8 GB Bare-metal
Competitor Server	44-core Xeon Gold 6152, 187 GB RAM, Tesla T4 16 GB Bare-metal. Used for LDBC SF1 competitor testing.

Dataset

# Friendster (SNAP)
wget https://snap.stanford.edu/data/bigdata/communities/com-friendster.ungraph.txt.gz
gunzip com-friendster.ungraph.txt.gz
# 65,608,366 vertices, 1,806,067,135 undirected edges
# Stored as 3,612,134,270 bidirectional edges
# File: 31 GB, tab-separated, # comment lines
# SHA-256 of uncompressed: verify with sha256sum

xrayGraphDB Setup

# 1. Set kernel parameter (required for large graph builds)
sudo sysctl -w vm.max_map_count=1048576

# 2. Start xrayGraphDB (Docker, GPU-enabled)
docker run -d --user 0:0 --gpus all --shm-size 10g --network=host \
  -v /var/lib/xraygraphdb:/var/lib/xraygraphdb \
  -v /usr/local/cuda-12.4/targets/x86_64-linux/lib:/usr/local/cuda/lib64:ro \
  -e LD_LIBRARY_PATH=/usr/lib/xraygraphdb/lib:/usr/local/cuda/lib64 \
  --name xg-bench \
  xraygraphdb.emtailabs.com/xraygraphdb:latest \
  --data-directory=/var/lib/xraygraphdb \
  --bolt-port=7687 --xray-port=7689 \
  --storage-properties-on-edges=true \
  --log-level=INFO --also-log-to-stderr=true \
  --license-acknowledge-saved=true \
  --init-admin-user=admin --init-admin-password=xraygraphdb \
  --init-admin-tenant=xraygraphdb

# 3. Load Friendster (copy into container, then import)
docker cp com-friendster.ungraph.txt xg-bench:/tmp/xraygraphdb-import/
# Then via xgdb_connect Python client:
from xgdb_connect.protocol import XrayProtocolClient
c = XrayProtocolClient(host="127.0.0.1", port=7689,
    auth_token="admin:xraygraphdb", database="xraygraphdb")
result = c.bulk_import_file("/tmp/xraygraphdb-import/com-friendster.ungraph.txt")
# See scripts/ directory for exact commands
# Result: 65,608,366 vertices, 3,612,134,270 edges in ~375s

Running the Benchmarks

# All scripts at: github.com/eMTAi-Labs/xraygraph-bench/scripts/

# GPU analytics suite (PageRank, TriangleCount, WCC, K-Core, BC, Community)
python3 blackwell_gpu_rerun.py

# Graph500-style TEPS measurement (16 BFS sources)
python3 graph500_teps.py

# Apples-to-apples competitor comparison (same source vertex 71768986)
python3 apples_to_apples.py

# cuGraph head-to-head (same hardware, same dataset)
python3 cugraph_bench.py

# Individual competitor benchmarks
python3 duckdb_bench.py
python3 kuzu_bench.py
python3 neo4j_gds_bench.py
python3 memgraph_bench.py
python3 graphblas_bench.py

Standard Source Vertex

All BFS comparisons use vertex 71768986 (undirected degree 5,214 — highest in Friendster). This ensures apples-to-apples comparison across systems. Vertex counts at each hop were verified to match across xrayGraphDB, Kuzu, and DuckDB:

Hop 1: 5,214–5,215 vertices (all systems agree)
Hop 2: 2,151,463 vertices (exact match)
Hop 3: 35,113,876 vertices (only xrayGraphDB reached this)

Algorithm Parameters

PageRank	20 iterations, damping=0.85, tolerance=0.0
Triangle Count	Undirected, verified against SNAP ground truth (4,173,724,142)
Betweenness Centrality	Approximate, 50 sampled sources, epsilon=0.05
Community Detection	Label propagation, 20 iterations
K-Core	Full core decomposition (max core = 304)
BFS / TEPS	Native BFS, OUTGOING direction, up to 20 hops

Competitor Versions Tested

cuGraph	26.02.00 (RAPIDS), cuDF 26.02.01, pip install cugraph-cu12
Neo4j	2025.04.0 Community (tarball), no GDS plugin available
Memgraph	2.22.0 (deb package), native install, no MAGE
Kuzu	0.11.3 (pip install kuzu), embedded
DuckDB	1.5.2 (pip install duckdb), embedded
GraphBLAS	SuiteSparse 9.4.5 via python-graphblas 2025.2.0

cuGraph Failure Analysis

cuGraph failed 7 of 8 algorithms on the same 96 GB Blackwell GPU. Three separate failures prevented undirected graph construction:

cuDF CSV parser OOM: cudf.read_csv() consumed 92 GB of 96 GB VRAM parsing the 31 GB text file before crashing.
CSR sort+symmetrize OOM: After CPU-read fallback, cuGraph's undirected CSR builder OOM'd during radix sort — even with 82 GB VRAM free.
cudf int32 size_type limit: Pre-symmetrized 3.6B rows exceed cudf's int32 offset maximum (2,147,483,647). Fundamental limitation.

Only a directed graph (1.8B edges) could be constructed. On that directed graph, cuGraph BFS achieved 20.3 GTEPS (82ms) — but PageRank failed (FailedToConvergeError on asymmetric link structure), and all algorithms requiring undirected input returned “input graph must be undirected.”

Important: cuGraph is not “bad.” It is optimized for specialized GPU graph kernels with maximum throughput. xrayGraphDB is optimized for persistent relationship-intelligence workloads at billion-edge scale. These are different system categories with different tradeoffs, and the comparison reflects that distinction honestly.

Raw Data & Scripts

Everything is published at github.com/eMTAi-Labs/xraygraph-bench:

results/BLACKWELL-GPU-RERUN-20260510.md — full GPU rerun analysis
results/CUGRAPH-COMPARISON-20260509.md — cuGraph head-to-head writeup
results/blackwell_gpu_rerun_20260510.json — raw JSON results
results/cugraph_blackwell.json — cuGraph raw JSON
results/apples_to_apples_blackwell.log — 7-system BFS comparison log
BENCHMARK-METHODOLOGY.md — 15-rule courtroom-clean methodology
REPRODUCIBILITY.md — step-by-step reproduction guide

If you can reproduce a different result, we want to know. File an issue.

Benchmark Results