share_log

让AI“边做饭边洗碗”​!DeepSeek开源周第四天:优化并行策略,梁文峰亲上阵!

Let AI "cook and wash dishes simultaneously"! On the fourth day of the DeepSeek open-source week: optimizing parallel strategy, Liang Wenfeng personally takes charge!

wallstreetcn ·  Feb 27 04:49

DualPipe and EPLB are two core technologies for large-scale AI model training, which improve model training speed and GPU utilization, respectively. The profile-data clearly presents the full-link performance data from training to inference of DeepSeek.

Three big breakthroughs announced all at once! DeepSeek's "Open Source Week" fourth release, open-source latest optimized parallel strategy, including DualPipe, Expert Parallel Load Balancer (EPLB), and full process performance data (profile-data).

According to reports, DualPipe and EPLB are two core technologies aimed at large-scale AI model training, focusing on distributed training efficiency optimization and expert parallel load balancing, both designed for V3/R1.

Specifically, DualPipe is a bidirectional pipeline parallel algorithm that aims to reduce pipeline "bubbles" (idle time) in distributed training through "bidirectional pipeline scheduling" and "computation-communication overlap," allowing the training process to run smoothly like a pipeline and improving GPU utilization.

For example, in traditional AI training, the "pipeline bubbles" caused by GPUs waiting for data transfer can occupy more than 30% of the time—whereas DualPipe allows AI training to have the ability to "cook while washing dishes," directly enabling the pipeline to "operate in both directions."

It is worth mentioning that DualPipe was jointly developed by three individuals—Jiashi Li, Chengqi Deng, and Liang Wenfeng.

In a mixture of experts model (MoE), uneven expert load often leads to GPU utilization below 60%. EPLB is an algorithm designed to solve this problem, which can allocate copies of high-load experts to idle GPUs, similar to how DiDi Global Inc schedules more vehicles in high-demand areas during peak times, thereby improving resource utilization.

Actual tests have shown that the GPU load difference decreased from 20%-30% to within 5%, and training speed increased threefold.

The final big surprise is that DeepSeek has directly released the full-link performance data from training to inference, essentially providing an "X-ray" of the AI training process, clearly demonstrating how DeepSeek-AI meticulously optimizes computation and Communications.

At the same time, the accompanying open-source performance analysis toolkit provides complete data tracking for the training, pre-filling, and decoding stages, allowing developers to visually analyze the computation-Communications overlap efficiency through a browser.

Editor/lambor

The translation is provided by third-party software.


The above content is for informational or educational purposes only and does not constitute any investment advice related to Futu. Although we strive to ensure the truthfulness, accuracy, and originality of all such content, we cannot guarantee it.
8
Comment Comment · Views 7181

Recommended

Write a comment

Statement

This page is machine-translated. Futubull tries to improve but does not guarantee the accuracy and reliability of the translation, and will not be liable for any loss or damage caused by any inaccuracy or omission of the translation.