AMD closed the efficiency hole with Nvidia’s Blackwell accelerators with the launch of the MI355X this spring. Now the corporate simply wants to beat Nvidia’s CUDA software program benefit and make that perf extra accessible to builders.
The release of AMD’s ROCm 7.0 software program platform this week is a step in that course, promising main enhancements in inference and coaching efficiency that profit not solely its newest chips however its older MI300-series components as effectively. The so-called CUDA moat could possibly be getting narrower.
ROCm, in case you’re not acquainted, is a collection of software program libraries and growth instruments, together with HIP frameworks, that gives builders a low-level programming interface for operating high-performance computing (HPC) and AI workloads on GPUs. The software program stack is reminiscent in some ways of the CUDA runtime, however for AMD GPUs fairly than Nvidia.
For the reason that launch of the MI300X, its first actually AI-optimized graphics accelerator, again in 2023, AMD has prolonged assist for brand spanking new datatypes, improved compatibility with in style runtimes and frameworks, and launched hardware-specific optimizations by its ROCm runtime.
ROCm 7 is arguably AMD’s greatest replace but. In comparison with ROCm 6, AMD says that clients can count on a roughly 3.5x uplift in inference efficiency on the MI300X. In the meantime, the corporate says it has managed to spice up the efficient floating level efficiency achieved in mannequin coaching by 3x.
AMD claims that these software program enhancements mixed give its newest and biggest GPU, the MI355X, a 1.3x edge in inference workloads over Nvidia’s B200 when operating DeepSeek R1 in SGLang. As common, it’s best to take all vendor efficiency claims with a grain of salt.
Whereas the MI350X and MI355X are roughly on par with the B200 by way of floating level efficiency, reaching 9.2 and 10 petaFLOPS of dense FP4 to Nvidia’s 9 petaFLOPs, the AMD components boast 108 GB extra HBM3e.
The AMD MI355X’s fundamental competitor is definitely Nvidia’s B300, which packs 288 GB of HBM3e and manages 14 petaFLOPS of dense FP4 efficiency, which on paper might give it an edge in inference workloads.
Talking of FP4 assist, the MI350 collection is AMD’s first technology of GPUs to supply {hardware} acceleration for OCP’s microscaling datatypes, which we checked out in additional element round OpenAI’s gpt-oss launch final month.
These smaller codecs have main implications for inference and coaching efficiency boosting throughput and chopping reminiscence necessities by an element of two to 4x. ROCm 7.0.0 extends broader assist for these low precision datatypes with AMD saying its Quark quantization framework is now manufacturing prepared.
This can be a massive enchancment in comparison with giving the cardboard FP8 assist which trailed the discharge of the MI300 by the higher a part of a 12 months.
Alongside the datatypes, ROCm 7.0.0 additionally introduces AMD’s AI Tensor Engine, or AITER for brief, which options specialised operators tuned for max GenAI efficiency.
For inference, AMD says AITER can increase MLA decode operations by 17x and MHA prefill ops by 14x. When utilized to fashions like DeepSeek R1, the GPU slinger says AITER can increase throughput by greater than 2x.
Extra importantly, AITER and the MXFP4 datatype have already been merged into in style inference serving engines like vLLM and SGLang. AMD tells us that enabling the characteristic is so simple as putting in the dependencies and setting the suitable setting variables.
Different enhancements embody assist for the newest Ubuntu 24.04.3 LTS launch in addition to Rocky Linux 9 and KVM passthrough for people who need to add GPU acceleration to digital machines.
ROCm 7 additionally provides native assist for PyTorch 2.7 and a pair of.9, TensorFlow 2.19.1, and JAX 0.6.
Lastly, for these deploying giant portions of Intuition accelerators in manufacturing, AMD is rolling out a pair of recent dashboards designed to make managing giant clusters of GPUs simpler. AMD’s Useful resource Supervisor gives detailed telemetry on the efficiency and utilization of the cluster, in addition to entry controls and the flexibility to set undertaking quotas in order that one group would not find yourself hogging all of the compute.
Alongside the useful resource supervisor, AMD can be rolling out an AI Workbench which is designed to streamline the method of coaching or positive tuning in style basis fashions.
ROCm 7.0 is on the market for download from AMD’s assist website, in addition to in pre-baked container images on Docker Hub. ®
Source link