Facebook owner Meta is building the world’s largest AI supercomputer to power machine-learning research that will bring the metaverse to life in the future, it claimed on Monday.
The new super – dubbed the Research Super Computer, or RSC – will contain 16,000 Nvidia A100 GPUs and 4,000 AMD Epyc Rome 7742 processors. Each compute node is an Nvidia DGX-A100 system, containing eight GPU chips and two Epyc microprocessors, totaling 2,000 nodes.
It’s expected to hit a peak performance of 5 exaFLOPS at mixed precision – FP16 and FP32 – and use a data-caching system that can feed in 16 terabytes per second of training information from 1EB of storage, we’re told.
RSC is being built with the help of Penguin Computing, a HPC supplier based in California, who will provide the infrastructure and managed security.
“Meta has developed what we believe is the world’s fastest AI supercomputer,” CEO Mark Zuckerberg said in a statement to The Register.
“We’re calling it RSC for AI Research SuperCluster and it’ll be complete later this year. The experiences we’re building for the metaverse require enormous compute power (quintillions of operations / second) and RSC will enable new AI models that can learn from trillions of examples, understand hundreds of languages, and more.”
Nvidia confirmed the cluster is expected to be the largest customer installation of DGX A100 systems once it’s fully built and up-and-running by mid-2022. “RSC took just 18 months to go from an idea on paper to a working AI supercomputer,” Nvidia said.
Read more from our sister sites:
At the moment, however, RSC is less flashy, delivering 1,895 PFLOPS of TF32 performance. It’s right now made up of 760 Nvidia DGX-A100 systems containing a total 1,520 AMD Rome 7742 processors and 6,080 GPUs. Each GPU is connected via Nvidia’s Quantum InfiniBand, which is capable of shuttling data back and forth at 200 gigabytes per second.
The RSC can also store 175 petabytes in Pure Storage FlashArray hardware, 46 petabytes in a cache storage, and 10 petabytes in Pure’s FlashBlade object storage equipment. It’s expected that this capacity will grow, using Pure products.
RSC is estimated to be 9X faster than Meta’s previous research cluster, made up of 22,000 of Nvidia’s older generation V100 GPUs and 20X faster than its current systems used to run AI models in production. The older research cluster could run up to 35,000 training jobs a day.
“The benefit of this new design is that we are able to scale to many GPUs without performance drops,” a Meta spokesperson told The Register. “We expect to have a smaller number of training jobs running than our previous AI research infrastructure but each job would train larger models to fully utilize the design.”
Meta is focused on building self-supervised learning and transformer-based models. These architectures are easy to scale up, and are getting increasingly complex. They can process multiple types of data, such as audio, text, and images using a single model. RSC has been built in mind to train models with over a trillion parameters on datasets that can reach up to an exabyte. The computational workloads required are equivalent to streaming a HD video for about 36,000 years, we’re told.
“We hope RSC will help us build entirely new AI systems that can, for example, power real-time voice translations to large groups of people, each speaking a different language, so they can seamlessly collaborate on a research project or play an AR game together,” Meta’s Kevin Lee, technical program manager and Shubho Sengupta, software engineer, explained in a blog post.
“Ultimately, the work done with RSC will pave the way toward building technologies for the next major computing platform — the metaverse, where AI-driven applications and products will play an important role.” ®
Source link