Abstract
As machine learning models scale in size and complexity, their computational
requirements become a significant barrier. Mixture-of-Experts (MoE) models
alleviate this issue by selectively activating relevant experts. Despite this,
MoE models are hindered by high communication overhead from all-to-all
operations, low GPU utilization due to the synchronous communication
constraint, and complications from heterogeneous GPU environments.
This paper presents Aurora, which optimizes both model deployment and
all-to-all communication scheduling to address these challenges in MoE
inference. Aurora achieves minimal communication times by strategically
ordering token transmissions in all-to-all communications. It improves GPU
utilization by colocating experts from different models on the same device,
avoiding the limitations of synchronous all-to-all communication. We analyze
Aurora's optimization strategies theoretically across four common GPU cluster
settings: exclusive vs. colocated models on GPUs, and homogeneous vs.
heterogeneous GPUs. Aurora provides optimal solutions for three cases, and for
the remaining NP-hard scenario, it offers a polynomial-time sub-optimal
solution with only a 1.07x degradation from the optimal.
Aurora is the first approach to minimize MoE inference time via optimal model
deployment and communication scheduling across various scenarios. Evaluations
demonstrate that Aurora significantly accelerates inference, achieving speedups
of up to 2.38x in homogeneous clusters and 3.54x in heterogeneous environments.
Moreover, Aurora enhances GPU utilization by up to 1.5x compared to existing
methods.
BibTeX
@online{Li_2410.17043, TITLE = {Optimizing Mixture-of-Experts Inference Time Combining Model Deployment and Communication Scheduling}, AUTHOR = {Li, Jialong and Tripathi, Shreyansh and Rastogi, Lakshay and Lei, Yiming and Pan, Rui and Xia, Yiting}, LANGUAGE = {eng}, URL = {https://arxiv.org/abs/2410.17043}, EPRINT = {2410.17043}, EPRINTTYPE = {arXiv}, YEAR = {2024}, MARGINALMARK = {$\bullet$}, ABSTRACT = {As machine learning models scale in size and complexity, their computational<br>requirements become a significant barrier. Mixture-of-Experts (MoE) models<br>alleviate this issue by selectively activating relevant experts. Despite this,<br>MoE models are hindered by high communication overhead from all-to-all<br>operations, low GPU utilization due to the synchronous communication<br>constraint, and complications from heterogeneous GPU environments.<br> This paper presents Aurora, which optimizes both model deployment and<br>all-to-all communication scheduling to address these challenges in MoE<br>inference. Aurora achieves minimal communication times by strategically<br>ordering token transmissions in all-to-all communications. It improves GPU<br>utilization by colocating experts from different models on the same device,<br>avoiding the limitations of synchronous all-to-all communication. We analyze<br>Aurora's optimization strategies theoretically across four common GPU cluster<br>settings: exclusive vs. colocated models on GPUs, and homogeneous vs.<br>heterogeneous GPUs. Aurora provides optimal solutions for three cases, and for<br>the remaining NP-hard scenario, it offers a polynomial-time sub-optimal<br>solution with only a 1.07x degradation from the optimal.<br> Aurora is the first approach to minimize MoE inference time via optimal model<br>deployment and communication scheduling across various scenarios. Evaluations<br>demonstrate that Aurora significantly accelerates inference, achieving speedups<br>of up to 2.38x in homogeneous clusters and 3.54x in heterogeneous environments.<br>Moreover, Aurora enhances GPU utilization by up to 1.5x compared to existing<br>methods.<br>}, }
Endnote
%0 Report %A Li, Jialong %A Tripathi, Shreyansh %A Rastogi, Lakshay %A Lei, Yiming %A Pan, Rui %A Xia, Yiting %+ Networks and Cloud Systems, MPI for Informatics, Max Planck Society External Organizations External Organizations Networks and Cloud Systems, MPI for Informatics, Max Planck Society External Organizations Networks and Cloud Systems, MPI for Informatics, Max Planck Society %T Optimizing Mixture-of-Experts Inference Time Combining Model Deployment and Communication Scheduling : %G eng %U http://hdl.handle.net/21.11116/0000-0010-43DB-C %U https://arxiv.org/abs/2410.17043 %D 2024 %8 22.10.2024 %X As machine learning models scale in size and complexity, their computational<br>requirements become a significant barrier. Mixture-of-Experts (MoE) models<br>alleviate this issue by selectively activating relevant experts. Despite this,<br>MoE models are hindered by high communication overhead from all-to-all<br>operations, low GPU utilization due to the synchronous communication<br>constraint, and complications from heterogeneous GPU environments.<br> This paper presents Aurora, which optimizes both model deployment and<br>all-to-all communication scheduling to address these challenges in MoE<br>inference. Aurora achieves minimal communication times by strategically<br>ordering token transmissions in all-to-all communications. It improves GPU<br>utilization by colocating experts from different models on the same device,<br>avoiding the limitations of synchronous all-to-all communication. We analyze<br>Aurora's optimization strategies theoretically across four common GPU cluster<br>settings: exclusive vs. colocated models on GPUs, and homogeneous vs.<br>heterogeneous GPUs. Aurora provides optimal solutions for three cases, and for<br>the remaining NP-hard scenario, it offers a polynomial-time sub-optimal<br>solution with only a 1.07x degradation from the optimal.<br> Aurora is the first approach to minimize MoE inference time via optimal model<br>deployment and communication scheduling across various scenarios. Evaluations<br>demonstrate that Aurora significantly accelerates inference, achieving speedups<br>of up to 2.38x in homogeneous clusters and 3.54x in heterogeneous environments.<br>Moreover, Aurora enhances GPU utilization by up to 1.5x compared to existing<br>methods.<br> %K Computer Science, Learning, cs.LG,Computer Science, Networking and Internet Architecture, cs.NI