Elon Musk Surprised Jensen Huang by Installing 100,000 Nvidia GPUs in Just 19 Days. It Normally Takes Years

  • This kind of deployment would typically require three years of planning followed by an additional year for integration and testing.

  • However, the team at AI startup xAI accomplished the remarkable feat in only 19 days.

In early 2023, Tesla CEO Elon Musk entered the artificial intelligence race by founding his own AI company, xAI. His goal was to compete with established players like OpenAI, Microsoft, and Google. To do this, he needed a supercomputer capable of delivering high performance. After launching initial versions of Grok, a competitor to ChatGPT, xAI revealed what it claimed to be the “most powerful AI training cluster in the world” in July. Featuring 100,000 Nvidia GPUs, the cluster is located in Memphis, Tennessee.

Recent developments regarding this project have emerged from a conversation Nvidia CEO Jensen Huang had with the hosts of the BG2 podcast. Huang shared that the xAI team managed to move from the concept stage to full integration of the 100,000 processing units in the Memphis cluster in just 19 days. This milestone was achieved alongside their first training task, which Musk proudly announced on X.

Assembling a Data Center in 19 Days

In the interview, Huang explained that the process involved assembling the GPUs and equipping the facility with a liquid cooling system and a power supply to ensure the chips could operate effectively. “There’s only one person in the world who could do that,” he said. He added that much of the achievement was possible because his teams collaborated with Musk’s AI company’s “extraordinary” software, networking, and infrastructure teams.

Huang also provided some intriguing data that can help understand the dimensions of the work involved. According to his calculations, setting up a 100,000 GPU supercomputer typically takes about four years. Three years are spent on planning, while the final year is dedicated to installing and testing the equipment to ensure everything functions correctly. Establishing a dedicated data center to handle high workloads is a significant challenge, which includes fixing bugs and optimizing performance.

The Nvidia CEO also said that integrating 100,000 H100 GPUs had “never been done before” and would likely not be replicated by another company for some time.

Additionally, it’s important to note that the xAI cluster is an infrastructure that uses remote direct memory access technology. This technology allows for fast and efficient data transfers, resulting in improved performance. A crucial aspect of this setup is its scalability, meaning it can be expanded over time, presumably with additional GPUs.

Image | Maurizio Pesce | Steve Jurvetson

Related | Nvidia Already Rules the AI Hardware Market. With the Launch of Its Own Open-Source LLM, It’s Now Going After GPT-4 and Llama

See all comments on https://www.xatakaon.com

SEE 0 Comment

Cover of Xataka On