The supercomputing panorama is fracturing. What as soon as was a comparatively unified world of large multi-processor x86 methods has splintered into competing architectures, every racing to serve radically totally different masters: conventional educational workloads, extreme-scale physics simulations, and the voracious urge for food of AI coaching runs.
On the heart of this upheaval stands Nvidia, whose GPU revolution has not simply made inroads, and it has detonated the previous order totally.
The results are stark. Legacy storage methods that powered many years of scientific breakthroughs now buckle beneath AI’s relentless, random I/O storms. Services designed for sequential throughput face a brand new actuality the place metadata can eat 20 p.c of all I/O operations. And as GPU clusters scale into the hundreds, a brutal financial reality emerges: each second of GPU idle time bleeds cash, reworking storage from a help operate right into a make-or-break aggressive benefit.
We sat down with Ken Claffey, CEO of VDURA, to know how this seismic shift is forcing an entire rethink of supercomputing infrastructure, from {hardware} to software program, from structure to economics.
Blocks & Information: How do you outline a supercomputer and an HPC system? What are the variations between them?
Ken Claffey: The strains are undoubtedly gray and more and more blurred. Traditionally the delineation has actually been concerning the dimension (variety of nodes) of the system, as Linux clusters of commodity servicers grew to become the defacto constructing block (vs beforehand customized supercomputers just like the early Cray methods or NEC vector supercomputers). Right now the normal segmentation of Workgroup, Division, Divisional and Supercomputer most likely wants extra updating, as a small GPU cluster’s greenback worth is now such that it might be categorized by the analysts as a supercomputer sale.
Blocks & Information: What totally different sorts of supercomputer are there, and do they differ by workload and processors?
Ken Claffey: Not all supercomputers are the identical. There are Linux Cluster supercomputers. These dominate at present’s Top500 record. They’re constructed from hundreds of commodity servers linked through InfiniBand or Ethernet or proprietary interconnects. Variants embrace:
- Massively parallel clusters with distributed reminiscence (e.g., the DOE’s Frontier). Every node runs its personal OS and communicates through message passing.
- Commodity clusters constructed from off-the-shelf x86/GPU servers; hyperscale AI clusters fall right here.
Completely different workloads favor totally different architectures; CPU-heavy or GPU-heavy, or memory-centric. Climate and physics simulations profit from vector or massively parallel clusters with low latency interconnects.
Fashionable AI coaching typically makes use of GPU heavy commodity clusters.
Particular function methods serve slim domains like cryptography or sample matching, however are gaining traction once more in AI-related use instances, particularly for Inference, Grok, SambaNova and so on.
Blocks & Information: Is an Nvidia NVL72 rack-scale GPU server a supercomputer?
Ken Claffey: Nvidia describes its GB200 NVL72 as an “exascale AI supercomputer in a rack.” Every NVL72 encloses 18 compute trays (72 Blackwell GPUs coupled with Grace CPUs) tied collectively by fifth era NVLink switches delivering 130 TBps of interconnect bandwidth. The NVLink cloth creates a single unified reminiscence area with over 1 petabyte per second combination bandwidth, and one NVL72 rack can ship 80 petaflops of AI efficiency with 1.7 TB of unified HBM reminiscence.
From a purist HPC perspective, a single NVL72 is extra precisely a rackscale constructing block than a full supercomputer, it lacks the exterior storage and cluster administration layers wanted for full blown HPC. However when tens or a whole bunch of NVL72 racks are interconnected with high-performance storage (for instance, VDURA V5000), the ensuing system completely qualifies as a supercomputer. So NVL72 sits on the boundary: a particularly dense GPU cluster that may be half of a bigger HPC system.
Blocks & Information: Do you suppose the Nvidia GPU HBM will or can switch to different varieties of supercomputer? Why did Nvidia get HBM developed and never different supercomputer sorts?
Ken Claffey: Excessive bandwidth reminiscence (HBM) stacks DRAM dies by silicon vias to supply thousand bit broad interfaces; HBM3e can ship as much as 1.8 TB/s per GPU. HBM isn’t distinctive to Nvidia, AMD’s MI300A/MI300X, Intel’s Ponte Vecchio and lots of AI accelerators use HBM as a result of streaming information at terabyte per second speeds is important for feeding hungry cores. HBM adoption is determined by economics and bundle design: GPUs can justify the fee as a result of they ship very excessive flops per watt, whereas basic function CPUs typically depend on DDR/LPDDR reminiscence with decrease bandwidth.
Nvidia’s management in GPU HBM has been pushed by AI’s insatiable demand for reminiscence bandwidth. GPU distributors codesign the silicon with HBM suppliers (Samsung, Micron, SK Hynix) to maximise bandwidth. Conventional supercomputer distributors typically concentrate on CPU centric workloads the place massive DDR reminiscence footprints matter greater than uncooked bandwidth. We count on HBM to proliferate in GPU-based AI methods and a few CPU architectures, however commodity servers will proceed to steadiness value and capability with DDR reminiscence. In the end, reminiscence know-how will unfold the place the economics make sense.
Blocks & Information: How is the world of supercomputing reacting to AI workloads reminiscent of coaching and inference?
Ken Claffey: The AI revolution has turned HPC services into AI factories. It is clear from clients that their software panorama is altering as their customers deploy increasingly AI based mostly purposes which is creating new challenges for the HPC infrastructure as they enhance the variety of GPUs of their clusters. This in flip impacts storage as AI purposes are GPU centric and create spiky, random I/O patterns, inflicting metadata to develop into 10–20 p.c of I/O. Each coaching and inference require sustained throughput: Nvidia recommends 0.5 GBps reads and 0.25 GBps writes per GPU for DGX B200 servers and as much as 4 GBps per GPU for imaginative and prescient workloads. Meaning a ten,000 GPU cluster wants 5 TBps learn and 2.5 TBps write bandwidth.
To fulfill this demand, HPC facilities are embracing parallel file methods and NVMe first architectures. AI coaching nonetheless depends on excessive throughput parallel file methods to feed GPUs and deal with large checkpointing, whereas inference workloads shift towards object shops and key worth semantics, requiring sturdy metadata efficiency and multitenancy. The rise of GPU accelerators has shifted I/O patterns from massive sequential writes to extremely random, small file operations. Consequently:
- HPC services are upgrading networks to InfiniBand NDR and Ethernet 400 Gb/s and deploying NVMe‑based mostly storage servers to saturate GPUs.
- Distributors are including GPU Direct and RDMA‑based mostly I/O paths to bypass CPU bottlenecks and cut back latency.
- AI and HPC groups more and more deal with information pipelines as manufacturing strains, emphasizing resilience and automation. VDURA’s white paper highlights how GPU idle time and sluggish checkpointing waste cash, prompting new storage architectures that decrease stalls.
Blocks & Information: How has supercomputing and HPC storage developed? What are the principle threads?
Ken Claffey: HPC storage has developed from proprietary, hardware-bound architectures to software-defined, scale-out methods designed for AI and GPU-driven workloads. Moreover whereas HPC was very a lot designed on the idea of momentary /Scratch performant file methods, AI is extra targeted on sustained efficiency and a broader SLA that cares rather more about operational reliability.
- From proprietary to software program outlined: Early HPC relied on closed methods with HA pairs and devoted RAID controllers. Fashionable platforms have shifted to SDS fashions aligned with hyperscaler designs, shared-nothing architectures that scale horizontally throughout commodity {hardware} containing NVMe nodes and open provide chains.
- Flash & HDDs, not flash-only: The transfer from HDD to NVMe flash introduced large efficiency positive factors, however effectivity at scale now is determined by utilizing the complete spectrum of media; SLC, TLC, QLC flash and CMR/SMR HDDs to steadiness throughput, IOPs endurance, and price.
- Metadata and automation: AI’s billions of small recordsdata make metadata more and more a possible efficiency bottleneck and an rising share of the quantity information saved; say 10–20 p.c. VDURA’s VeLO distributed metadata engine eliminates this bottleneck, supporting billions of operations with ultra-low latency.
- Operational Reliability and Resilience at scale. Legacy node native RAID has been changed by network-level erasure coding for larger resiliency to failures – rising sturdiness and availability. VDURA’s really gives much more with multi-level erasure coding (MLEC) that achieves higher availability and as much as 12 nines of sturdiness, making certain steady operation.
HPC storage has developed into AI-ready, software-defined infrastructure; flash-first, media-aware, metadata-accelerated, and operationally resilient sufficient to maintain tempo with the quickest GPUs 24 by 7 by 365.
Blocks & Information: What are the principle supercomputer storage methods and the way do they differ?
Ken Claffey: Supercomputing storage has diverged alongside a transparent line between legacy, hardware-bound methods and fashionable, software-defined architectures constructed for AI and data-intensive workloads.
The business is shifting on from hardware-defined “methods” (controller pairs, proprietary arrays) to software-defined storage (SDS) “platforms” that run on commodity NVMe and HDD media. SDS permits quicker innovation, mixed-media tiering (SLC, TLC, QLC flash + CMR/SMR HDD), metadata acceleration, and cloud-like scalability – the inspiration of VDURA’s structure.
Blocks & Information: Why are there so lots of them? Are they suited to totally different supercomputing workloads?
Ken Claffey: Whereas the HPC ecosystem seems numerous, solely a small group of file methods have been confirmed at manufacturing scale throughout hundreds of environments. Many others stay analysis initiatives or area of interest deployments.
- Legacy methods vs. software-defined platforms: Legacy HPC file methods like Lustre or GPFS are methods hardware-tied and manually scaled. Fashionable parallel file methods reminiscent of VDURA’s PanFS characterize software-defined platforms that separate the management and information planes, align with hyperscaler-style shared-nothing architectures, and run on commodity NVMe and HDD provide chains.
- Tasks vs. Merchandise: Open-source efforts (e.g., DAOS) push innovation however typically stay project-grade, whereas industrial SDS platforms evolve because of long run funding and steady improvement into hardened merchandise that steadiness efficiency, manageability, and long-term help.
- Workload alignment: AI and HPC workloads range broadly, some stream multi-terabyte sequential information, others learn billions of tiny recordsdata randomly. No single file system can optimize all instances, so purpose-built storage is changing general-purpose designs like NAS and SAN based mostly methods. Hybrid SDS platforms like VDURA combine flash and HDD tiers, deal with metadata acceleration, supply practically limitless linear efficiency scalability and ship the supply and sturdiness at present’s AI factories demand.
There could also be many names in HPC storage, however only some really function at scale in manufacturing environments and the clear course is away from legacy {hardware} methods towards versatile, software-defined, purpose-built information platforms.
Blocks & Information: Why is it that DAOS has not develop into extra standard?
Ken Claffey: DAOS is an open-source venture. At this level, it’s considered extra as a set of applied sciences than a completed product. It’s now housed at HPE, and I count on they’ll make investments to make it a real product, very like I did with Lustre at ClusterStor. That may take a few years of heavy funding, large-scale deployments, and operational maturity to take it from ‘venture’ to ‘product’.
Blocks & Information: How may VDURA use DAOS? May PanFS evolve to make use of DAOS ideas?
Ken Claffey: We see the key-value retailer (KVS) metadata method as directionally appropriate, similar to how PanFShas lengthy operated with its personal built-in KVS. This similar idea is now mirrored within the VDURA Information Platform, the place we’ve additional superior and scaled our metadata engine to satisfy the calls for of contemporary AI and HPC workloads.
Blocks & Information: There are IOPS and throughput. Inform me why throughput issues for AI workloads
Ken Claffey: IOPS (enter/output operations per second) measures what number of small 4 KiB operations a storage system can carry out. It’s a nice metric for transactional databases and VMs. However AI and HPC workloads stream massive datasets and checkpoints. Specializing in IOPS can mislead: AI workloads are throughput pushed, measured in GBps or TBps, as a result of they transfer massive, sequential datasets. Excessive bandwidth ensures that GPUs stay busy and that checkpointing doesn’t stall coaching. Parallel file methods distribute information throughout many nodes to ship this combination bandwidth. With out adequate throughput, GPUs are starved and costly compute cycles are wasted.
VDURA’s V5000 system delivers >60 GBps per node and >2 TBps per rack. This ensures that AI pipelines are restricted by mannequin complexity, not storage. VDURA additionally supplies as much as 100Ms IOPS per rack, so it handles meta information heavy inference workloads as effectively. The lesson: throughput and IOPS each issues, however for AI coaching, throughput is king.
Blocks & Information: Do parallel storage methods deliver particular benefits to supercomputers that non-parallel (serial?) storage methods can’t present?
Ken Claffey: Completely. Non-parallel NAS methods like NetApp ONTAP depend on a small variety of controllers dealing with I/O. As I beforehand identified, basic function NAS can’t ship the throughput or resiliency required for AI. NetApp’s AFX is their try at a parallel file system. Mainstream storage methods had been designed for basic function computing.
In a transparent acknowledgement of superior computing in AI, NetApp has acknowledged that they want a brand new kind of product that may be a parallel file system. They weren’t ready for the long run and now they’re attempting to catch up.
Blocks & Information: Is GPU Direct a method of constructing non-parallel storage methods, like NetApp, successfully parallel?
Ken Claffey: No. Should you’re not parallel you’re restricted to how briskly the one path can go. Positive, GPU Direct could make that one path go quicker, though that’s not as scalable as a parallel file system that may go down many paths concurrently. Particularly when these parallel paths are GPU Direct enabled.
Blocks & Information: Now that VDURA’s PanFS helps GPU Direct, how else may VDURA adapt it to serve Nvidia GPU servers higher? For instance, KV Cache offload.
Ken Claffey: We’re engaged on issues on this space, keep tuned. ®
Source link



