DualPipe and EPLB are two core technologies for large-scale AI model training, which improve model training speed and GPU utilization, respectively. The profile-data clearly presents the full-link performance data from training to inference of DeepSeek.
Three big breakthroughs announced all at once! DeepSeek's "Open Source Week" fourth release, open-source latest optimized parallel strategy, including DualPipe, Expert Parallel Load Balancer (EPLB), and full process performance data (profile-data).
According to reports, DualPipe and EPLB are two core technologies aimed at large-scale AI model training, focusing on distributed training efficiency optimization and expert parallel load balancing, both designed for V3/R1.
Specifically, DualPipe is a bidirectional pipeline parallel algorithm that aims to reduce pipeline "bubbles" (idle time) in distributed training through "bidirectional pipeline scheduling" and "computation-communication overlap," allowing the training process to run smoothly like a pipeline and improving GPU utilization.
For example, in traditional AI training, the "pipeline bubbles" caused by GPUs waiting for data transfer can occupy more than 30% of the time—whereas DualPipe allows AI training to have the ability to "cook while washing dishes," directly enabling the pipeline to "operate in both directions."
It is worth mentioning that DualPipe was jointly developed by three individuals—Jiashi Li, Chengqi Deng, and Liang Wenfeng.
In a mixture of experts model (MoE), uneven expert load often leads to GPU utilization below 60%. EPLB is an algorithm designed to solve this problem, which can allocate copies of high-load experts to idle GPUs, similar to how DiDi Global Inc schedules more vehicles in high-demand areas during peak times, thereby improving resource utilization.
Actual tests have shown that the GPU load difference decreased from 20%-30% to within 5%, and training speed increased threefold.
The final big surprise is that DeepSeek has directly released the full-link performance data from training to inference, essentially providing an "X-ray" of the AI training process, clearly demonstrating how DeepSeek-AI meticulously optimizes computation and Communications.
At the same time, the accompanying open-source performance analysis toolkit provides complete data tracking for the training, pre-filling, and decoding stages, allowing developers to visually analyze the computation-Communications overlap efficiency through a browser.
Editor/lambor
Comment(0)
Reason For Report