Chinese AI startup DeepSeek says it has trained AI models that are comparable to leading models from leading companies such as OpenAI, Meta, and Anthropic, but with an 11x reduction in the amount and cost of GPU computing. . While this claim has not yet been fully verified, this surprising announcement suggests that while US sanctions are impacting the availability of AI hardware in China, smart scientists are predicting a suffocating impact. This suggests that they are working to get the most performance out of a limited amount of hardware in order to alleviate this problem. China supplies AI chips. The company has open sourced its models and weights, so expect tests to be published soon.
According to the paper, Deepseek trained the DeepSeek-V3 Mixture-of-Experts (MoE) language model with 671 billion parameters in just two months using a cluster containing 2,048 Nvidia H800 GPUs. This means 2.8 million GPU hours. For comparison, Meta required 11 times more compute power (30.8 million GPU hours) to train Llama 3 with 405 billion parameters using a cluster with 16,384 H100 GPUs over 54 days .
DeepSeek uses advanced pipeline algorithms, an optimized communication framework, and FP8’s low-precision computation and communication to significantly reduce the compute and memory demands typically required for models of this scale. I am claiming.
The company uses a cluster of 2,048 Nvidia H800 GPUs, each equipped with an NVLink interconnect for GPU-to-GPU communication and an InfiniBand interconnect for node-to-node communication. In such a setup, GPU-to-GPU communication is fairly fast, but node-to-node communication is not, so optimization is the key to performance and efficiency. DeepSeek has implemented dozens of optimization techniques to reduce the computing requirements of DeepSeek-v3, and several key technologies enable its impressive results.
DeepSeek used the DualPipe algorithm to overlap computation and communication phases within and between forward and backward microbatches to reduce pipeline inefficiencies. In particular, dispatch (routing tokens to experts) and join (aggregating results) operations were processed in parallel with computation using customized PTX (parallel threaded execution) instructions. This means writing specialized low-level code to interface with Nvidia CUDA. Optimize your GPU and its behavior. According to DeepSeek, the DualPipe algorithm minimizes training bottlenecks, especially in the cross-node expert parallelism required by MoE architectures, and this optimization allows clusters to run with near-zero communication overhead during pre-training. It can now process 14.8 trillion tokens.
In addition to implementing DualPipe, DeepSeek limited each token to a maximum of four nodes to limit the number of nodes involved in communication. This reduced traffic and allowed communication and computation to overlap effectively.
A key element in reducing computing and communication requirements was the introduction of low-precision training methods. DeepSeek employs the FP8 mixed-precision framework to enable faster calculations and reduced memory usage without compromising numerical stability. Key operations such as matrix multiplication were performed in FP8, while sensitive components such as embedding and normalization layers were kept at higher precision (BF16 or FP32) to ensure accuracy. This approach consistently resulted in relative training loss errors of less than 0.25% and reduced memory requirements while maintaining robust accuracy.
In terms of performance, the company says its DeepSeek-v3 MoE language model is on par with or better than GPT-4x, Claude-3.5-Sonnet, and LLlama-3.1, depending on the benchmark. Naturally, you have to make sure that third-party benchmarks prove that. The company has open sourced its models and weights, so expect tests to be published soon.
Although DeepSeek-V3 may be inferior to frontier models such as GPT-4o and o3 in terms of number of parameters and inference capabilities, DeepSeek’s achievements demonstrate that it can achieve advanced MoE using relatively limited resources. It shows that you can train a language model. Of course, this requires a lot of optimization and low-level programming, but the results seem surprisingly good.
The DeepSeek team recognizes that deploying the DeepSeek-V3 model requires advanced hardware and a deployment strategy that separates the prepopulation and decoding stages, but small businesses may not be able to do this due to lack of resources. It may not be possible.
“While we recognize the superior performance and cost-effectiveness of DeepSeek-V3, we also recognize that DeepSeek-V3 has several limitations, particularly with respect to deployment,” the company’s paper reads. It is written. “First, to ensure efficient inference, the recommended deployment unit for DeepSeek-V3 is relatively large, which can be burdensome for small teams.Second, DeepSeek-V3 ‘s deployment strategy has achieved an end-to-end generation, but fortunately these limitations are expected to be resolved naturally with the development of more advanced hardware.