The Blackwell Ultra NVL72 platform has emerged as the performance leader in agentic AI infrastructure, according to the first round of results from AgentPerf, a new benchmark designed specifically for multi-step AI workflows. Unlike traditional benchmarks that measure single LLM calls, AgentPerf evaluates how systems handle the complex, chained operations typical of agentic AI—where dozens or hundreds of model interactions occur before a task is completed.

This shift in measurement reflects a fundamental change in how AI is being deployed at scale. Agents don't just respond to prompts; they break tasks into sub-tasks, execute tool calls like code compilation and database searches, and iterate based on results. The performance demands are multiplicative, not additive, stressing infrastructure in ways that traditional benchmarks simply can't capture.

On the DeepSeek V4 Pro model—a large mixture-of-experts (MoE) architecture—Blackwell Ultra NVL72 delivers up to 20x more agents per megawatt than NVIDIA's Hopper-based HGX H200 system. The advantage is consistent across service-level objectives, whether targeting 20 or 60 tokens per second per agent.

That performance gap isn't just about raw throughput; it's a result of extreme codesign across the full stack. Blackwell Ultra NVL72 connects 72 GPUs into a single rack-scale system, enabling efficient distribution of model execution at scale. CUDA kernels further optimize this by overlapping communication and compute, reducing latency without adding overhead. TensorRT LLM sustains efficiency even as concurrent sessions scale, separating input processing from output generation to fine-tune performance independently.

Blackwell Ultra NVL72 Sets New Standard for Agentic AI Infrastructure

The benchmark's methodology is rooted in real-world agentic workflows, drawing from actual coding trajectories across 12+ programming languages. It measures how many agentic tasks a platform can support simultaneously while meeting strict thresholds for responsiveness and token output rate. Tool calls are simulated with representative CPU processing times, ensuring that performance differences reflect accelerated computing rather than external factors.

For enterprises deploying AI agents at scale, the implications are clear: infrastructure choices now hinge on how many concurrent agentic tasks can be run per accelerator and per megawatt of power. The numbers directly translate to productivity gains—how much useful work an investment in AI infrastructure can actually deliver.

Early adopters like Baseten, DeepInfra, and Together AI are already leveraging Blackwell's performance for production applications. Together AI, for example, powers Cursor, an agentic coding platform that debugs issues, generates features, and executes refactors in real time—all on Blackwell infrastructure. DeepInfra deploys agents for Pam.ai, an AI workforce platform for car dealerships, handling service appointments, calls, and sales campaigns entirely on Blackwell.

The Vera Rubin architecture, now in full production, promises to further extend Blackwell's lead as agentic AI demands grow. While performance will only improve with continued software optimization, the current results set a new baseline for what enterprises should expect from AI infrastructure today.