IBM is the most recent tech large to unveil its personal “AI supercomputer,” this one composed of a bunch of digital machines working inside IBM Cloud.

The system referred to as Vela, which the corporate claims has been on-line since Might final yr, is touted as IBM’s first AI-optimized, cloud-native supercomputer, created with the goal of growing and coaching large-scale AI fashions.

Earlier than anybody rushes off to join entry, IBM acknowledged that the platform is presently reserved to be used by the IBM Analysis neighborhood. In truth, Vela has develop into the corporate’s “go-to surroundings” for researchers creating superior AI capabilities since Might 2022, together with work on basis fashions, it stated.

IBM states that it selected this structure as a result of it provides the corporate larger flexibility to scale up as required, and likewise the flexibility to deploy comparable infrastructure into any IBM Cloud datacenter across the globe.

However Vela is just not working on any previous customary IBM Cloud node {hardware}; every is a twin-socket system with 2nd Gen Xeon Scalable processors configured with 1.5TB of DRAM, and 4 3.2TB NVMe flash drives, plus eight 80GB Nvidia A100 GPUs, the latter linked by NVLink and NVSwitch.

This makes the Vela infrastructure nearer to that of a excessive efficiency compute (HPC) web site than typical cloud infrastructure, regardless of IBM’s insistence that it was taking a distinct path as “conventional supercomputers weren’t designed for AI.”

It’s also notable that IBM selected to make use of x86 processors relatively than its personal Energy 10 chips, particularly as these had been touted by Big Blue as being ideally suited to memory-intensive workloads equivalent to large-model AI inferencing.

The nodes are interconnected utilizing a number of 100Gbps community interfaces organized in a two-level Clos construction, which is designed so there are a number of paths for information to offer redundancy.

Nevertheless, IBM explains in a weblog submit its causes for choosing a cloud-native structure, which heart on chopping down the time required to construct and deploy giant scale AI fashions as a lot as doable.

“Will we construct our system on-premises, utilizing the normal supercomputing mannequin, or will we construct this technique into the cloud, in essence constructing a supercomputer that can also be a cloud?” the weblog asks.

IBM claims that by adopting the latter method, it has compromised considerably on efficiency, however gained significantly on productiveness. This comes all the way down to the flexibility to configure all the mandatory assets by means of software program, in addition to accessing providers accessible on the broader IBM Cloud, with the instance of loading information units onto IBM’s Cloud Object Retailer as a substitute of getting to construct devoted storage infrastructure.

Massive Blue additionally stated it opted to function all of the nodes in Vela as digital machines relatively than naked steel cases as this made it less complicated to provision and re-provision the infrastructure with completely different software program stacks required by completely different AI customers.

“VMs would make it simple for our help crew to flexibly scale AI clusters dynamically and shift assets between workloads of varied sorts in a matter of minutes,” IBM’s weblog explains.

However the firm claims that it discovered a approach to optimize efficiency and reduce the virtualization overhead all the way down to lower than 5 p.c, shut to reveal steel efficiency.

This included configuring the naked steel host for virtualization with help for Digital Machine Extensions (VMX), single-root IO virtualization (SR-IOV) and large pages, amongst different unspecified {hardware} and software program configurations.

Additional particulars of the Vela infrastructure may be discovered on IBM’s blog.

IBM is just not the one firm utilizing the cloud to host an AI supercomputer. Final yr, Microsoft unveiled its own platform utilizing Azure infrastructure mixed with Nvidia’s GPU accelerators, community package, and its AI Enterprise software program suite. This was anticipated to be accessible for Azure prospects to entry, however no timeframe was specified.

Different firms which have been constructing AI supercomputers, however following the normal on-premises infrastructure route, embrace Meta and Tesla. ®


Source link