Meta Platforms Inc. is well on the way to building what it says will be the world’s fastest artificial intelligence-focused supercomputer to tackle new, advanced workloads involving natural language processing and computer vision.

The company revealed today the AI Research SuperCluster or RSC, though not yet complete, is already up and running and being used to train large AI models with billions of parameters.

Meta has long been an ambitious player in AI research, and with the launch of its new supercomputer it becomes clear why that is. The company sees AI playing a fundamental role in the advancement of the so-called metaverse, which is a virtual world where Meta believes people will increasingly come together to socialize, work and play.

“We hope RSC will help us build entirely new AI systems that can, for example, power real-time voice translations to large groups of people, each speaking a different language, so they can seamlessly collaborate on a research project or play an AR game together,” Meta AI researchers Kevin Lee and Shubho Sengupta wrote in a blog post. “Ultimately, the work done with RSC will pave the way toward building technologies for the next major computing platform — the metaverse, where AI-driven applications and products will play an important role.”

Meta’s researchers explained that the company has in recent times made big strides in the area of “self-supervised learning,” which is where algorithms learn from vast numbers of unlabeled examples. It has also led advances in “transformers,” which make it possible for AI to reasons more effectively by focusing on certain areas of their input.

To realize the full benefits of self-supervised learning and transformer-based models, Meta concluded, it would need to train increasingly more complex and adaptable AI models, meaning it would need to crunch vastly more amounts of data. To develop more advanced computer vision models for instance, requires processing larger and longer videos with higher data sampling rates.

Meanwhile, speech recognition needs to work in the most challenging scenarios with lots of background noise, and natural language processing must understand different languages, accents and dialects. So Meta decided it needed a much more powerful computer than what’s presently available.

The RSC is nothing if not powerful, comprising a total of 6,080 of Nvidia Corp.’s latest A100 graphics processing units. Those GPUs are combined into multiple compute nodes that are interconnected with Nvidia’s high-performance Quantum 200 gigabit-per-second InfiniBand networking fabric. RSC also offers 175 petabytes of storage made from Pure Storage Inc.’s FlashArrays, plus 46 petabytes of cache storage from Penguin Computing Inc.’s Altus systems.

“Early benchmarks on RSC, compared with Meta’s legacy production and research infrastructure, have shown that it runs computer vision workflows up to 20 times faster, runs the Nvidia Collective Communication Library (NCCL) more than nine times faster, and trains large scale NLP models three times faster,” Meta AI’s researchers said. “That means a model with tens of billions of parameters can finish training in three weeks, compared with nine weeks before.”

However, Meta is aiming to make RSC vastly more powerful, with plans to connect a total of 16,000 GPUs by the middle of this year.

Besides focusing on speed and power, RSC has also been built with security in mind. Meta’s ambitions in AI will necessitate that it uses a lot of “real-world data” from its own production systems, the company explained. So obviously it needs to be very careful to safeguard that information.

“RSC has been designed from the ground up with privacy and security in mind, so that Meta’s researchers can safely train models using encrypted user-generated data that is not decrypted until right before training,” Lee and Sengupta wrote.

Those protections include ensuring RSC is isolated from the public internet, with no direct inbound or outbound connections. At the same time, the entire path from Meta’s storage systems to the GPUs is encrypted, with data only being decrypted right before it’s used, at the GPU endpoint, in memory.

Holger Mueller of Constellation Research Inc. told SiliconANGLE the race to dominate the metaverse is in full swing and that AI will be a key part of it.

“It’s no surprise that one of the key players with big ambitions for the metaverse, Meta, is building its first AI supercomputer for research purposes,” Mueller said. “The metaverse is still at its inception and a lot of research needs to happen for it to really take off, so Meta is taking some key first steps towards doing that.”

Meta explained that although RSC is up and running now, it will soon become far more powerful. Throughout the rest of the year, its plan is to increase RSC’s GPU count from 6,080 to 16,000, boosting its overall AI training performance by more than 2.5 times its current level

“We expect such a step function change in compute capability to enable us not only to create more accurate AI models for our existing services, but also to enable completely new user experiences, especially in the metaverse,” the company explained. “Our long-term investments in self-supervised learning and in building next-generation AI infrastructure with RSC are helping us create the foundational technologies that will power the metaverse and advance the broader AI community as well.”

Photo: Meta

Show your support for our mission by joining our Cube Club and Cube Event Community of experts. Join the community that includes Amazon Web Services and Amazon.com CEO Andy Jassy, Dell Technologies founder and CEO Michael Dell, Intel CEO Pat Gelsinger and many more luminaries and experts.


Source link