'A virtual DPU within a GPU': Could clever hardware hack be behind DeepSeek's groundbreaking AI efficiency?

A brand new method known as DualPipe appears to be the important thing to DeekSeek’s success
One professional describes it as an on-GPU digital DPU that maximizes bandwidth effectivity
Whereas DeepSeek has used Nvidia GPUs solely, one wonders how AMD’s Intuition would fare

China’s DeepSeek AI chatbot has shocked the tech trade, representing a reputable various to OpenAI’s ChatGPT at a fraction of the price.

A recent paper revealed DeepSeek V3 was educated on a cluster of two,048 Nvidia H800 GPUs – crippled variations of the H100 (we will solely think about how rather more highly effective it could be operating on AMD Intuition accelerators!). It reportedly required 2.79 million GPU-hours for pretraining, fine-tuning on 14.8 trillion tokens, and value – in accordance with calculations made by The Next Platform – a mere $5.58 million.

However precisely how DeepSeek’s builders managed this feat is probably going all the way down to a intelligent hack.

A digital DPU on the GPU itself

First, some background. DeepSeek is a sophisticated Combination-of-Consultants (MoE) language mannequin designed to optimize efficiency by selectively activating solely essentially the most related elements of its structure for every activity. The third model of the mannequin, DeepSeek-V3, includes a complete of 671 billion parameters, with solely 37 billion activated for any given token prediction. This selective activation massively reduces computational prices whereas sustaining excessive efficiency and accuracy – which you’ll see in case you attempt it.

It’s simple to be skeptical of DeepSeek and the claims made relating to its coaching, however the paper reveals a number of the magic the builders got here up with to profit from the crippled {hardware} they needed to work with. This consists of the creation of the DualPipe algorithm for environment friendly pipeline parallelism.

In keeping with the data revealed by DeepSeek, DualPipe overlaps ahead and backward computation, reduces latency, and optimizes knowledge motion throughout GPUs. By effectively managing communication, it minimizes idle time (pipeline bubbles) and dynamically balances GPU compute cores (Streaming Multiprocessors) between computation and communication, stopping knowledge switch bottlenecks because the mannequin scales.

A commenter on The Subsequent Platform describes DualPipe as “basically making a digital DPU on the GPU itself to deal with all-to-all communication,” which highlights its position in optimizing knowledge switch effectivity.

The paper goes into additional element, “So as to guarantee adequate computational efficiency for DualPipe, we customise environment friendly cross-node all-to-all communication kernels (together with dispatching and mixing) to preserve the variety of SMs devoted to communication. The implementation of the kernels is co-designed with the MoE gating algorithm and the community topology of our cluster. To be particular, in our cluster, cross-node GPUs are absolutely interconnected with IB, and intra-node communications are dealt with through NVLink.”

Example DualPipe scheduling

Instance DualPipe scheduling for 8 PP ranks and 20 micro-batches in two instructions. The micro-batches within the reverse path are symmetric to these within the ahead path, so we omit their batch ID for illustration simplicity. Two cells enclosed by a shared black border have mutually overlapped computation and communication. (Picture credit score: DeekSeek)

You may additionally like

Source link

‘A virtual DPU within a GPU’: Could clever hardware hack be behind DeepSeek’s groundbreaking AI efficiency?

[email protected]

Leave a Reply Cancel reply

The Feeding Frenzy Fueling Food Media M&A

Travel app – Flutter Mobile App Template

Mercor says it was ‘one of thousands’ hit in LiteLLM attack • The Register

Press ESC to close

Share Article:

Search ad spending surged in Q4 2024, with retail media leading: report

10 Best Event Management Software and Tools in 2025

Leave a Reply Cancel reply