The Secret to DeepSeek’s Extreme Efficiency Is Out: It Bypassed Nvidia’s CUDA Standard

  • DeepSeek engineers used PTX to maximize the H800 GPUs’ performance.

  • One strategy was to use only 20 SMs of each GPU for inter-server communication.

DeepSeek bypasses Nvidia's CUDA standard
No comments Twitter Flipboard E-mail
juan-carlos-lopez

Juan Carlos López

Senior Writer

An engineer by training. A science and tech journalist by passion, vocation, and conviction. I've been writing professionally for over two decades, and I suspect I still have a long way to go. At Xataka, I write about many topics, but I mainly enjoy covering nuclear fusion, quantum physics, quantum computers, microprocessors and TVs. LinkedIn

Releasing the DeepSeek AI V3 model as open source is a blessing. The strategy DeepSeek engineers devised to develop such an efficient AI model is gradually coming to light. Before continuing, it’s essential to remember that DeepSeek claims to have trained its model using only 2,048 Nvidia H800 chips.

Some analysts say its infrastructure consists of 50,000 H100 GPUs purchased through intermediaries, though this remains conjecture. The H100 is more powerful than the H800, but it’s entirely plausible that DeepSeek had to settle for the second due to U.S. government sanctions preventing Chinese companies from accessing the H100. As of November 2023, Nvidia is also barred from shipping its H800 chip to Chinese customers.

One of the Keys to DeepSeek’s Success: PTX

The GPUs of Nvidia aren’t the only factor behind its rapid growth over the past five years. The company’s compute unified device architecture (CUDA) has played a crucial role. Most AI projects today rely on CUDA, which unifies the compilers and development tools programmers use to write software for Nvidia GPUs. Replacing it in ongoing projects presents challenges.

Huawei, seeking a significant share of China’s AI market, has developed its own computing architecture for neural networks as an alternative to CUDA. For now, though, CUDA dominates. Nvidia’s tool provides a high-level language that gives programmers affordable access to GPU hardware. However, DeepSeek engineers bypassed CUDA and instead used parallel thread execution (PTX).

DeepSeek engineers used PTX to maximize the performance of the H800 GPUs’ in their possession.

PTX, similar to assembler, is the low-level language Nvidia suggests for developers who need to implement optimizations directly on its GPUs. Programming with PTX is more complex and time-consuming than using CUDA, but it allows developers to write more efficient code that better utilizes GPU resources.

Presumably, DeepSeek engineers used PTX to maximize the H800 GPUs’ performance. One of their stratagems was using only 20 streaming multiprocessors (SMs) per GPU for server-to-server communication, leaving the remaining 112 SMs on each chip for computation. Essentially, Chinese engineers built DeepSeek from the ground up with such optimizations, largely explaining the AI model’s efficiency.

DeepSeek’s programmers have achieved an engineering feat likely to influence how AI model developers approach their projects. It’s tangible proof that China has successfully adapted to the GPU shortage caused by U.S. sanctions.

Image | Nvidia

Related | Downloading and Installing DeepSeek on Your Computer: How to Use It Locally on Windows, macOS, and Linux

Home o Index
×

We use third-party cookies to generate audience statistics and display personalized advertising by analyzing your browsing habits. If you continue browsing, you will be accepting their use. More information