NVIDIA and CoreWeave Achieve Record-Breaking Graph Processing Performance

In a significant advancement for high-performance computing, NVIDIA has achieved a record-breaking result in the Graph500 benchmark, demonstrating unparalleled speed and efficiency in graph processing at scale. The winning run, which clocked in at 410 trillion traversed edges per second (TEPS), places it firmly at the top of the 31st Graph500 breadth-first search (BFS) list – more than double the performance of comparable solutions, including those used by national laboratories.

This remarkable feat was made possible through a commercially available cluster hosted in a CoreWeave data center. The system utilized 8,192 NVIDIA H100 GPUs to process an immense graph containing 2.2 trillion vertices and 35 trillion edges. To put this performance into perspective: if every person on Earth had 150 friends (representing 1.2 trillion connections), the system could search through all those relationships in approximately three milliseconds.

Efficiency Beyond Speed

While raw speed is impressive, NVIDIA and CoreWeave's achievement goes beyond mere performance. The efficiency of their solution stands out significantly. While other top contenders on the Graph500 list required around 9,000 nodes, the winning run achieved its results using just over 1,000 nodes – a remarkable three times better performance per dollar.

This efficiency is attributed to NVIDIA's full-stack approach, combining advanced compute (H100 GPUs), networking (Spectrum-X), and software technologies, including the CUDA platform and a new active messaging library. This integrated design minimizes hardware footprint while maximizing processing power.

Understanding Graph Processing at Scale

Graphs are fundamental to many modern technologies, underpinning everything from social networks and banking applications to cybersecurity systems. They represent relationships between data points in complex webs of information. For example, on LinkedIn, a user's profile is a vertex, and connections to other users are edges.

The Graph500 BFS benchmark serves as the industry standard for measuring a system’s ability to navigate these irregular and often sparse graphs at scale. A high TEPS score indicates superior interconnectivity between compute nodes, ample memory bandwidth, and software optimized to leverage the system's capabilities – validating the entire engineering of the system.

NVIDIA H100 (极客湾Geekerwan) 025

Traditional Challenges in Graph Processing

Historically, CPUs have been used for graph processing, moving data across compute nodes. As graphs grow to trillions of edges, this constant movement creates significant bottlenecks. Developers have employed techniques like active messaging—sending small messages that process data locally—to mitigate these issues. However, existing active messaging approaches were designed for CPUs and inherently limited by their capabilities.

GPU-Accelerated Active Messaging: A Breakthrough

NVIDIA’s solution reimagines graph processing by engineering a full-stack, GPU-only system that enables GPU-to-GPU active messaging. This was achieved through a custom software framework utilizing InfiniBand GPUDirect Async (IBGDA) and the NVSHMEM parallel programming interface.

IBGDA allows GPUs to directly communicate with the network interface card, enabling hundreds of thousands of GPU threads to send active messages simultaneously—a significant improvement over CPU-based systems. This bypasses the CPU entirely, allowing for full utilization of the massive parallelism and memory bandwidth of NVIDIA H100 GPUs.

The result is a system that can efficiently move data across the network, process it on the receiver, and achieve unprecedented performance while minimizing costs thanks to CoreWeave’s stable infrastructure.

Broader Implications for High-Performance Computing

This breakthrough extends beyond graph processing, impacting fields like fluid dynamics and weather forecasting that rely on similar sparse data structures and communication patterns. For decades, these areas have been constrained by CPU limitations at large scales.

NVIDIA’s success with Graph500 validates a new approach to high-performance computing, opening doors for developers to leverage NVSHMEM and IBGDA to scale their applications efficiently on commercially available infrastructure. This marks a significant step towards democratizing access to advanced computing power for a wider range of industries and research areas.