In early 2023, Tesla CEO Elon Musk entered the artificial intelligence race by founding his own AI company, xAI. His goal was to compete with established players like OpenAI, Microsoft, and Google. To do this, he needed a supercomputer capable of delivering high performance. After launching initial versions of Grok, a competitor to ChatGPT, xAI revealed what it claimed to be the “most powerful AI training cluster in the world” in July. Featuring 100,000 Nvidia GPUs, the cluster is located in Memphis, Tennessee.
Recent developments regarding this project have emerged from a conversation Nvidia CEO Jensen Huang had with the hosts of the BG2 podcast. Huang shared that the xAI team managed to move from the concept stage to full integration of the 100,000 processing units in the Memphis cluster in just 19 days. This milestone was achieved alongside their first training task, which Musk proudly announced on X.
Assembling a Data Center in 19 Days
In the interview, Huang explained that the process involved assembling the GPUs and equipping the facility with a liquid cooling system and a power supply to ensure the chips could operate effectively. “There’s only one person in the world who could do that,” he said. He added that much of the achievement was possible because his teams collaborated with Musk’s AI company’s “extraordinary” software, networking, and infrastructure teams.
Huang also provided some intriguing data that can help understand the dimensions of the work involved. According to his calculations, setting up a 100,000 GPU supercomputer typically takes about four years. Three years are spent on planning, while the final year is dedicated to installing and testing the equipment to ensure everything functions correctly. Establishing a dedicated data center to handle high workloads is a significant challenge, which includes fixing bugs and optimizing performance.
The Nvidia CEO also said that integrating 100,000 H100 GPUs had “never been done before” and would likely not be replicated by another company for some time.
Additionally, it’s important to note that the xAI cluster is an infrastructure that uses remote direct memory access technology. This technology allows for fast and efficient data transfers, resulting in improved performance. A crucial aspect of this setup is its scalability, meaning it can be expanded over time, presumably with additional GPUs.
Image | Maurizio Pesce | Steve Jurvetson
View 0 comments