Nvidia Stumbles Upon a New, Unexpected Issue: Blackwell AI Chips Are Reportedly Overheating

  • In August, a design flaw resulted in a lower than expected initial manufacturing yield of the B200 GPUs.

  • Now, another problem has popped up, with some customers reporting that Nvidia’s AI chip is overheating.

Nvidia
No comments Twitter Flipboard E-mail

The B200 GPUs are causing some challenges for Nvidia. When the company unveiled the AI chip in March 2024, it was evident that it had a powerful product on its hands. In the end, its specifications are impressive. It features 208 billion transistors and state-of-the-art Blackwell architecture and can achieve a maximum performance of 20 petaflops in FP4 operations when paired with liquid cooling. Additionally, the B200 GPU can work alongside a memory capacity of up to 192 GB of VRAM and offers an impressive 8 TB/s bandwidth.

The AI industry eagerly awaited the B200 GPU’s release. However, the launch of the AI chip has been slower than expected. In fact, the first units have only been delivered to customers over the past few weeks. In August, Nvidia acknowledged that the manufacturing yield was below expectations. The company’s engineers needed to redesign some layers of the chip to address issues that delayed the initial deliveries.

“We executed a change to the Blackwell GPU mask to improve production yield,” Nvidia said in a statement. CEO Jensen Huang addressed the situation directly, “We had a design flaw in Blackwell, it was functional, but the design flaw caused the yield to be low. It was 100% Nvidia’s fault.” He also dismissed reports of tensions between Nvidia and TSMC and said they were “fake news.”

Although Nvidia seems to have resolved the manufacturing yield issue, another one has recently emerged that’s also related to the B200 GPU.

Some Nvidia Customers Report Overheating Issues with B200 GPUs

According to Reuters, some of Nvidia’s early customers with servers equipped with the B200 GPU have reported overheating issues when these machines are installed together in racks designed to hold up to 72 chips. In these installations, it’s common to use racks that house a large number of highly integrated chips to maximize space and enhance the power of the infrastructure. However, a major challenge in setting up these installations is ensuring proper cooling for all components.

Nvidia has repeatedly requested its suppliers to modify the rack designs to enhance the cooling system.

Nvidia has acknowledged the problem. The company has reportedly requested its suppliers to redesign the racks multiple times in an effort to optimize the cooling system. In a bid to instill confidence among its customers, an Nvidia spokesperson told Reuters, “Nvidia is working with leading cloud service providers as an integral part of our engineering team and process. The engineering iterations are normal and expected.”

The bottom line is that two missteps have occurred in a relatively short period, which is unusual for a company like Nvidia, which typically operates smoothly. Nvidia is collaborating closely with its suppliers and customers to address the cooling issues encountered in Blackwell servers. As such, it’s likely that it’ll succeed in resolving these challenges soon.

However, it’s important to highlight that the exceptionally high demand for AI chips may be prompting Nvidia to rush its processes. The two issues encountered might have been avoided with more thorough and deliberate development, verification, and testing procedures. While market pressures are a reality, haste often leads to mistakes that could be prevented by adhering closely to established engineering processes and timelines.

Image | EdTech Stanford University School of Medicine

Related | Elon Musk Surprised Jensen Huang by Installing 100,000 Nvidia GPUs in Just 19 Days. It Normally Takes Years

Home o Index