Nvidia Stumbles Upon a New, Unexpected Issue: Blackwell AI Chips Are Reportedly Overheating

In August, a design flaw resulted in a lower than expected initial manufacturing yield of the B200 GPUs.
Now, another problem has popped up, with some customers reporting that Nvidia’s AI chip is overheating.

November 19, 2024, 11:00 ET

Juan Carlos López

Senior Writer

The B200 GPUs are causing some challenges for Nvidia. When the company unveiled the AI chip in March 2024, it was evident that it had a powerful product on its hands. In the end, its specifications are impressive. It features 208 billion transistors and state-of-the-art Blackwell architecture and can achieve a maximum performance of 20 petaflops in FP4 operations when paired with liquid cooling. Additionally, the B200 GPU can work alongside a memory capacity of up to 192 GB of VRAM and offers an impressive 8 TB/s bandwidth.

The AI industry eagerly awaited the B200 GPU’s release. However, the launch of the AI chip has been slower than expected. In fact, the first units have only been delivered to customers over the past few weeks. In August, Nvidia acknowledged that the manufacturing yield was below expectations. The company’s engineers needed to redesign some layers of the chip to address issues that delayed the initial deliveries.

“We executed a change to the Blackwell GPU mask to improve production yield,” Nvidia said in a statement. CEO Jensen Huang addressed the situation directly, “We had a design flaw in Blackwell, it was functional, but the design flaw caused the yield to be low. It was 100% Nvidia’s fault.” He also dismissed reports of tensions between Nvidia and TSMC and said they were “fake news.”

Although Nvidia seems to have resolved the manufacturing yield issue, another one has recently emerged that’s also related to the B200 GPU.

Some Nvidia Customers Report Overheating Issues with B200 GPUs

According to Reuters, some of Nvidia’s early customers with servers equipped with the B200 GPU have reported overheating issues when these machines are installed together in racks designed to hold up to 72 chips. In these installations, it’s common to use racks that house a large number of highly integrated chips to maximize space and enhance the power of the infrastructure. However, a major challenge in setting up these installations is ensuring proper cooling for all components.

Nvidia Stumbles Upon a New, Unexpected Issue: Blackwell AI Chips Are Reportedly Overheating

In August, a design flaw resulted in a lower than expected initial manufacturing yield of the B200 GPUs.

Now, another problem has popped up, with some customers reporting that Nvidia’s AI chip is overheating.

Some Nvidia Customers Report Overheating Issues with B200 GPUs