This summer season, AI chip startup Groq raised $750 million at a valuation of $6.9 billion. Simply three months later, Nvidia celebrated the vacations by dropping almost thrice that to license its expertise and squirrel away its expertise.
Within the days that adopted, the armchair AI gurus of the net have been speculating wildly as to how Nvidia can justify spending $20 billion to get Groq’s tech and folks.
Pundits imagine Nvidia is aware of one thing we do not. Theories run the gamut from the deal signifying Nvidia intends to ditch HBM for SRAM, a play to safe extra foundry capability from Samsung, or an try to quash a possible competitor. Some maintain water higher than others, and we definitely have a couple of of our personal.
What we all know up to now
Nvidia paid $20 billion to non-exclusively license Groq’s mental property, which incorporates its language processing items (LPUs) and accompanying software program libraries.
Groq’s LPUs kind the inspiration of its high-performance inference-as-a-service providing, which it would maintain and proceed to function with out interruption after the deal closes.
The association is clearly engineered to keep away from regulatory scrutiny. Nvidia is not shopping for Groq, it is licensing its tech. Besides… it’s very shopping for Groq.
How else to explain a deal that sees Groq’s CEO Jonathan Ross and president Sunny Madra transfer to Nvidia, together with most of its engineering expertise?
Positive, Groq is technically sticking round as an impartial firm with Simon Edwards on the helm as its new CEO, however with a lot of its expertise gone, it is arduous to see how the chip startup survives long-term.
The argument that Nvidia simply wiped a competitor off the board subsequently works. Whether or not that transfer was value $20 billion is one other matter, given it might provoke an antitrust lawsuit.
It have to be for the SRAM, proper?
One distinguished concept about Nvidia’s motives is that Groq’s LPUs use static random entry reminiscence (SRAM), which is orders of magnitude sooner than the high-bandwidth reminiscence (HBM) present in GPUs at present.
A single HBM3e stack can obtain about 1 TB/s of reminiscence bandwidth per module and eight TB/s per GPU at present. The SRAM in Groq’s LPUs could be 10 to 80 instances sooner.
Since massive language mannequin (LLM) inference is predominantly sure by reminiscence bandwidth, Groq can obtain stupendously quick token technology charges. In Llama 3.3 70B, the benchmarkers at Synthetic Evaluation report that Groq’s chips can churn out 350 tok/s. Efficiency is even higher when working a combination of consultants fashions, like gpt-oss 120B, the place the chips managed 465 tok/s.
We’re additionally in the course of a world reminiscence scarcity and demand for HBM has by no means been increased. So, we perceive why some may take a look at this deal and assume Groq might assist Nvidia deal with the looming reminiscence crunch.
The best reply is usually the best one – simply not this time.
Sorry to must let you know this, however there’s nothing particular about SRAM. It is in principally each trendy processor, together with Nvidia’s chips.
SRAM additionally has a reasonably obtrusive draw back. It isn’t precisely what you’d name area environment friendly. We’re speaking, at most, a couple of hundred megabytes per chip in comparison with 36 GB for a 12-high HBM3e stack for a complete of 288 GB per GPU.
Groq’s LPUs have simply 230 MB of SRAM every, which implies you want a whole lot and even 1000’s of them simply to run a modest LLM. At 16-bit precision, you’d want 140 GB of reminiscence to carry the mannequin weights and a further 40 GB for each 128,000 token sequence.
Groq wanted 574 LPUs stitched collectively utilizing a high-speed interconnect material to run Llama 70B.
You may get round this by constructing a much bigger chip – every of Cerebras’ WSE-3 wafers options greater than 40 GB of SRAM on board, however these chips are the dimensions of a dinner plate and devour 23 kilowatts. Anyway, Groq hasn’t gone this route.
Suffice it to say, if Nvidia wished to make a chip that makes use of SRAM as an alternative of HBM, it did not want to purchase Groq to do it.
Going with the info stream
So, what did Nvidia throw cash at Groq for?
Our greatest guess is that it was actually for Groq’s “meeting line structure.” That is primarily a programmable information stream design constructed with the specific function of accelerating the linear algebra calculations computed throughout inference.
Most processors at present use a Von Neumann structure. Directions are fetched from reminiscence, decoded, executed, after which written to a register or saved in reminiscence. Trendy implementations introduce issues like department prediction, however the ideas are largely the identical.
Information stream works on a special precept. Reasonably than a bunch of load-store operations, information stream architectures primarily course of information because it’s streamed by the chip.
As Groq explains it, these information conveyor belts “transfer directions and information between the chip’s SIMD (single instruction/a number of information) operate items.”
“At every step of the meeting course of, the operate unit receives directions by way of the conveyor belt. The directions inform the operate unit the place it ought to go to get the enter information (which conveyor belt), which operate it ought to carry out with that information, and the place it ought to place the output information.”
In keeping with Groq, this structure successfully eliminates bottlenecks that bathroom down GPUs, because it means the LPU is rarely ready for reminiscence or compute to catch up.
Groq could make this occur with an LPU and between them, which is nice information as Groq’s LPUs aren’t that potent on their very own. On paper, they obtain BF16 perf, roughly on par with an RTX 3090 or the INT8 perf of an L40S. However, do not forget that’s peak FLOPS beneath very best circumstances. In concept, information stream architectures ought to have the ability to obtain higher real-world efficiency for a similar quantity of energy.
It is value declaring that information stream architectures aren’t restricted to SRAM-centric designs. For instance, NextSilicon’s information stream structure makes use of HBM. Groq opted for an SRAM-only design as a result of it saved issues easy, however there is not any purpose Nvidia could not construct a knowledge stream accelerator based mostly on Groq’s IP utilizing SRAM, HBM, or GDDR.
So, if information stream is so a lot better, why is not it extra widespread? As a result of it is a royal ache to get proper. However, Groq has managed to make it work, a minimum of for inference.
And, as Ai2’s Tim Dettmers just lately put it, chipmakers like Nvidia are shortly working out of levers they’ll pull to juice chip efficiency. Information stream provides Nvidia new methods to use because it seeks additional velocity, and the cope with Groq means Jensen Huang’s firm is in a greater place to commercialize it.
An inference-optimized compute stack?
Groq additionally offers Nvidia with an inference-optimized compute structure, one thing that it has been sorely missing. The place it suits, although, is a little bit of a thriller.
Most of Nvidia’s “inference-optimized” chips, just like the H200 or B300, aren’t basically completely different from their “mainstream” siblings. In truth, the one distinction between the H100 and H200 was that the latter used sooner, increased capability HBM3e which simply occurs to learn inference-heavy workloads.
As a reminder, LLM inference could be damaged into two levels: the compute-heavy prefill stage, throughout which the immediate is processed, and the memory-bandwidth-intensive decode section throughout which the mannequin generates output tokens.
That is altering with Nvidia’s Rubin technology of chips in 2026. Introduced again in September, the Rubin CPX is designed particularly to speed up the compute-intensive prefill section of the inference pipeline, liberating up its HBM-packed Vera Rubin superchips to deal with decode.
This disaggregated structure minimizes useful resource competition and helps to enhance utilization and throughput.
Groq’s LPUs are optimized for inference by design, however they do not have sufficient SRAM to make for an excellent decode accelerator. They may, nonetheless, be fascinating as a speculative decoding half.
When you’re not acquainted, speculative decoding is a way which makes use of a small “draft” mannequin to foretell the output of a bigger one. When these predictions are right, system efficiency can double or triple, driving down value per token.
These speculative draft fashions are usually fairly small, typically consuming a couple of billion parameters at most, making Groq’s present chip designs believable for such a design.
Do we’d like a devoted accelerator for speculative decoding? Positive, why not. Is it value $20 billion? Depends upon the way you measure it. In contrast with publicly traded corporations whose complete valuation is round $20 billion, like HP, Inc., or Figma, it could appear steep. However for Nvidia, $20 billion is a comparatively reasonably priced quantity – it recorded $23 billion in money stream from operations final quarter alone. In the long run, it means extra chips and equipment for Nvidia to promote.
What about foundry diversification?
Maybe the least doubtless take we have seen is the suggestion that Groq in some way opens up extra foundry capability for Nvidia.
Groq at present makes use of GlobalFoundries to make its chips, and plans to construct its next-gen components on Samsung’s 4 nm course of tech. Nvidia, by comparability, does almost all of its fabrication at TSMC and is closely reliant on the Taiwanese big’s superior packaging tech.
The issue with this concept is that it does not truly make any sense. It isn’t like Nvidia cannot go to Samsung to fab its chips. In truth, Nvidia has fabbed chips at Samsung earlier than – the Korean big made most of Nvidia’s Ampere technology product. Nvidia wanted TSMC’s superior packaging tech for some components just like the A100, nevertheless it doesn’t want the Taiwanese firm to make Rubin CPX. Samsung or Intel can in all probability do the job.
All of this takes time, and licensing Groq’s IP and hiring its staff does not change that.
The fact is Nvidia could not do something with Groq’s present technology of LPUs. Jensen may simply be taking part in the lengthy recreation, as he is been recognized to do. ®
Source link


