Publications

Paper

RG2

J. Li, S. Tripathi, L. Rastogi, Y. Lei, R. Pan, and Y. Xia

“Optimizing Mixture-of-Experts Inference Time Combining Model Deployment and Communication Scheduling,” 2024. [Online]. Available: https://arxiv.org/abs/2410.17043.

mehr

Abstract

As machine learning models scale in size and complexity, their computational
requirements become a significant barrier. Mixture-of-Experts (MoE) models
alleviate this issue by selectively activating relevant experts. Despite this,
MoE models are hindered by high communication overhead from all-to-all
operations, low GPU utilization due to the synchronous communication
constraint, and complications from heterogeneous GPU environments.
This paper presents Aurora, which optimizes both model deployment and
all-to-all communication scheduling to address these challenges in MoE
inference. Aurora achieves minimal communication times by strategically
ordering token transmissions in all-to-all communications. It improves GPU
utilization by colocating experts from different models on the same device,
avoiding the limitations of synchronous all-to-all communication. We analyze
Aurora's optimization strategies theoretically across four common GPU cluster
settings: exclusive vs. colocated models on GPUs, and homogeneous vs.
heterogeneous GPUs. Aurora provides optimal solutions for three cases, and for
the remaining NP-hard scenario, it offers a polynomial-time sub-optimal
solution with only a 1.07x degradation from the optimal.
Aurora is the first approach to minimize MoE inference time via optimal model
deployment and communication scheduling across various scenarios. Evaluations
demonstrate that Aurora significantly accelerates inference, achieving speedups
of up to 2.38x in homogeneous clusters and 3.54x in heterogeneous environments.
Moreover, Aurora enhances GPU utilization by up to 1.5x compared to existing
methods.

BibTeX

@online{Li_2410.17043,
TITLE = {Optimizing Mixture-of-Experts Inference Time Combining Model Deployment and Communication Scheduling},
AUTHOR = {Li, Jialong and Tripathi, Shreyansh and Rastogi, Lakshay and Lei, Yiming and Pan, Rui and Xia, Yiting},
LANGUAGE = {eng},
URL = {https://arxiv.org/abs/2410.17043},
EPRINT = {2410.17043},
EPRINTTYPE = {arXiv},
YEAR = {2024},
MARGINALMARK = {$\bullet$},
ABSTRACT = {As machine learning models scale in size and complexity, their computational<br>requirements become a significant barrier. Mixture-of-Experts (MoE) models<br>alleviate this issue by selectively activating relevant experts. Despite this,<br>MoE models are hindered by high communication overhead from all-to-all<br>operations, low GPU utilization due to the synchronous communication<br>constraint, and complications from heterogeneous GPU environments.<br> This paper presents Aurora, which optimizes both model deployment and<br>all-to-all communication scheduling to address these challenges in MoE<br>inference. Aurora achieves minimal communication times by strategically<br>ordering token transmissions in all-to-all communications. It improves GPU<br>utilization by colocating experts from different models on the same device,<br>avoiding the limitations of synchronous all-to-all communication. We analyze<br>Aurora's optimization strategies theoretically across four common GPU cluster<br>settings: exclusive vs. colocated models on GPUs, and homogeneous vs.<br>heterogeneous GPUs. Aurora provides optimal solutions for three cases, and for<br>the remaining NP-hard scenario, it offers a polynomial-time sub-optimal<br>solution with only a 1.07x degradation from the optimal.<br> Aurora is the first approach to minimize MoE inference time via optimal model<br>deployment and communication scheduling across various scenarios. Evaluations<br>demonstrate that Aurora significantly accelerates inference, achieving speedups<br>of up to 2.38x in homogeneous clusters and 3.54x in heterogeneous environments.<br>Moreover, Aurora enhances GPU utilization by up to 1.5x compared to existing<br>methods.<br>},
}

Endnote

%0 Report
%A Li, Jialong
%A Tripathi, Shreyansh
%A Rastogi, Lakshay
%A Lei, Yiming
%A Pan, Rui
%A Xia, Yiting
%+ Networks and Cloud Systems, MPI for Informatics, Max Planck Society
External Organizations
External Organizations
Networks and Cloud Systems, MPI for Informatics, Max Planck Society
External Organizations
Networks and Cloud Systems, MPI for Informatics, Max Planck Society
%T Optimizing Mixture-of-Experts Inference Time Combining Model Deployment and Communication Scheduling : 
%G eng
%U http://hdl.handle.net/21.11116/0000-0010-43DB-C
%U https://arxiv.org/abs/2410.17043
%D 2024
%8 22.10.2024
%X   As machine learning models scale in size and complexity, their computational<br>requirements become a significant barrier. Mixture-of-Experts (MoE) models<br>alleviate this issue by selectively activating relevant experts. Despite this,<br>MoE models are hindered by high communication overhead from all-to-all<br>operations, low GPU utilization due to the synchronous communication<br>constraint, and complications from heterogeneous GPU environments.<br>  This paper presents Aurora, which optimizes both model deployment and<br>all-to-all communication scheduling to address these challenges in MoE<br>inference. Aurora achieves minimal communication times by strategically<br>ordering token transmissions in all-to-all communications. It improves GPU<br>utilization by colocating experts from different models on the same device,<br>avoiding the limitations of synchronous all-to-all communication. We analyze<br>Aurora's optimization strategies theoretically across four common GPU cluster<br>settings: exclusive vs. colocated models on GPUs, and homogeneous vs.<br>heterogeneous GPUs. Aurora provides optimal solutions for three cases, and for<br>the remaining NP-hard scenario, it offers a polynomial-time sub-optimal<br>solution with only a 1.07x degradation from the optimal.<br>  Aurora is the first approach to minimize MoE inference time via optimal model<br>deployment and communication scheduling across various scenarios. Evaluations<br>demonstrate that Aurora significantly accelerates inference, achieving speedups<br>of up to 2.38x in homogeneous clusters and 3.54x in heterogeneous environments.<br>Moreover, Aurora enhances GPU utilization by up to 1.5x compared to existing<br>methods.<br>
%K Computer Science, Learning, cs.LG,Computer Science, Networking and Internet Architecture, cs.NI

Conference paper

RG2

F. De Marchi, W. Bai, J. Li, and Y. Xia

“Rethinking Transport Protocols for Reconfigurable Data Centers: An Empirical Study,” in HotOptics ’24, 1st SIGCOMM Workshop on Hot Topics in Optical Technologies and Applications in Networking, Sidney, Australia, 2024.

mehr

BibTeX

@inproceedings{DeMarchi_HotOPTICS24,
TITLE = {Rethinking Transport Protocols for Reconfigurable Data Centers: {A}n Empirical Study},
AUTHOR = {De Marchi, Federico and Bai, Wei and Li, Jialong and Xia, Yiting},
LANGUAGE = {eng},
ISBN = {979-8-4007-0716-2},
DOI = {10.1145/3672201.367412},
PUBLISHER = {ACM},
YEAR = {2024},
MARGINALMARK = {$\bullet$},
DATE = {2024},
BOOKTITLE = {HotOptics '24, 1st SIGCOMM Workshop on Hot Topics in Optical Technologies and Applications in Networking},
PAGES = {7--13},
ADDRESS = {Sidney, Australia},
}

Endnote

%0 Conference Proceedings
%A De Marchi, Federico
%A Bai, Wei
%A Li, Jialong
%A Xia, Yiting
%+ Networks and Cloud Systems, MPI for Informatics, Max Planck Society
External Organizations
Networks and Cloud Systems, MPI for Informatics, Max Planck Society
Networks and Cloud Systems, MPI for Informatics, Max Planck Society
%T Rethinking Transport Protocols for Reconfigurable Data Centers: An Empirical Study : 
%G eng
%U http://hdl.handle.net/21.11116/0000-0010-2E36-F
%R 10.1145/3672201.367412
%D 2024
%B 1st SIGCOMM Workshop on Hot Topics in Optical Technologies and Applications in Networking
%Z date of event: 2024-08-04 - 2024-08-08
%C Sidney, Australia
%B HotOptics '24
%P 7 - 13
%I ACM
%@ 979-8-4007-0716-2

Conference paper

RG2

F. De Marchi, J. Li, W. Bai, and Y. Xia

“POSTER: Opportunistic Credit-Based Transport for Reconfigurable Data Center Networks with Tidal,” in ACM SIGCOMM Posters and Demos ’24, Sydney, Australia, 2024.

mehr

BibTeX

@inproceedings{DeMarchi_SIGCOMM,
TITLE = {{POSTER}: Opportunistic Credit-Based Transport for Reconfigurable Data Center Networks with Tidal},
AUTHOR = {De Marchi, Federico and Li, Jialong and Bai, Wei and Xia, Yiting},
LANGUAGE = {eng},
ISBN = {979-8-4007-0717-9},
DOI = {10.1145/3672202.3673714},
PUBLISHER = {ACM},
YEAR = {2024},
MARGINALMARK = {$\bullet$},
DATE = {2024},
BOOKTITLE = {ACM SIGCOMM Posters and Demos '24},
PAGES = {4--6},
ADDRESS = {Sydney, Australia},
}

Endnote

%0 Conference Proceedings
%A De Marchi, Federico
%A Li, Jialong
%A Bai, Wei
%A Xia, Yiting
%+ Networks and Cloud Systems, MPI for Informatics, Max Planck Society
Networks and Cloud Systems, MPI for Informatics, Max Planck Society
External Organizations
Networks and Cloud Systems, MPI for Informatics, Max Planck Society
%T POSTER: Opportunistic Credit-Based Transport for Reconfigurable Data Center Networks with Tidal : 
%G eng
%U http://hdl.handle.net/21.11116/0000-0010-41E5-2
%R 10.1145/3672202.3673714
%D 2024
%B 38th ACM SIGCOMM
%Z date of event: 2024-08-04 - 2024-08-08
%C Sydney, Australia
%B ACM SIGCOMM Posters and Demos  '24
%P 4 - 6
%I ACM
%@ 979-8-4007-0717-9

Paper

RG2

Y. Lei, F. De Marchi, J. Li, R. Joshi, B. Chandrasekaran, and Y. Xia

“Lighthouse: An Open Research Framework for Optical Data Center Networks,” 2024. [Online]. Available: https://arxiv.org/abs/2411.18319.

mehr

Abstract

Optical data center networks (DCNs) are emerging as a promising design for
cloud infrastructure. However, existing optical DCN architectures operate as
closed ecosystems, tying software solutions to specific optical hardware. We
introduce Lighthouse, an open research framework that decouples software from
hardware, allowing them to evolve independently. Central to Lighthouse is the
time-flow table abstraction, serving as a common interface between optical
hardware and software. We develop Lighthouse on programmable switches,
achieving a minimum optical circuit duration of 2 {\mu}s, the shortest duration
realized by commodity devices to date. We demonstrate Lighthouse's generality
by implementing six optical architectures on an optical testbed and conducted
extensive benchmarks on a 108-ToR setup, highlighting system efficiency.
Additionally, we present case studies that identify potential research topics
enabled by Lighthouse.

BibTeX

@online{Lei2411.18319,
TITLE = {Lighthouse: An Open Research Framework for Optical Data Center Networks},
AUTHOR = {Lei, Yiming and De Marchi, Federico and Li, Jialong and Joshi, Raj and Chandrasekaran, Balakrishnan and Xia, Yiting},
LANGUAGE = {eng},
URL = {https://arxiv.org/abs/2411.18319},
EPRINT = {2411.18319},
EPRINTTYPE = {arXiv},
YEAR = {2024},
MARGINALMARK = {$\bullet$},
ABSTRACT = {Optical data center networks (DCNs) are emerging as a promising design for<br>cloud infrastructure. However, existing optical DCN architectures operate as<br>closed ecosystems, tying software solutions to specific optical hardware. We<br>introduce Lighthouse, an open research framework that decouples software from<br>hardware, allowing them to evolve independently. Central to Lighthouse is the<br>time-flow table abstraction, serving as a common interface between optical<br>hardware and software. We develop Lighthouse on programmable switches,<br>achieving a minimum optical circuit duration of 2 {\mu}s, the shortest duration<br>realized by commodity devices to date. We demonstrate Lighthouse's generality<br>by implementing six optical architectures on an optical testbed and conducted<br>extensive benchmarks on a 108-ToR setup, highlighting system efficiency.<br>Additionally, we present case studies that identify potential research topics<br>enabled by Lighthouse.<br>},
}

Endnote

%0 Report
%A Lei, Yiming
%A De Marchi, Federico
%A Li, Jialong
%A Joshi, Raj
%A Chandrasekaran, Balakrishnan
%A Xia, Yiting
%+ Networks and Cloud Systems, MPI for Informatics, Max Planck Society
Networks and Cloud Systems, MPI for Informatics, Max Planck Society
External Organizations
External Organizations
External Organizations
Networks and Cloud Systems, MPI for Informatics, Max Planck Society
%T Lighthouse: An Open Research Framework for Optical Data Center Networks : 
%G eng
%U http://hdl.handle.net/21.11116/0000-0010-BF0F-8
%U https://arxiv.org/abs/2411.18319
%D 2024
%X   Optical data center networks (DCNs) are emerging as a promising design for<br>cloud infrastructure. However, existing optical DCN architectures operate as<br>closed ecosystems, tying software solutions to specific optical hardware. We<br>introduce Lighthouse, an open research framework that decouples software from<br>hardware, allowing them to evolve independently. Central to Lighthouse is the<br>time-flow table abstraction, serving as a common interface between optical<br>hardware and software. We develop Lighthouse on programmable switches,<br>achieving a minimum optical circuit duration of 2 {\mu}s, the shortest duration<br>realized by commodity devices to date. We demonstrate Lighthouse's generality<br>by implementing six optical architectures on an optical testbed and conducted<br>extensive benchmarks on a 108-ToR setup, highlighting system efficiency.<br>Additionally, we present case studies that identify potential research topics<br>enabled by Lighthouse.<br>
%K Computer Science, Networking and Internet Architecture, cs.NI

Conference paper

RG2

Y. Lei, F. De Marchi, R. Joshi, J. Li, B. Chandrasekaran, and Y. Xia

“DEMO: An Open Research Framework for Optical Data Center Networks,” in ACM SIGCOMM Posters and Demos ’24, Sydney, Australia, 2024.

mehr

BibTeX

@inproceedings{Lei_SIGCOMM,
TITLE = {{DEMO}: {A}n Open Research Framework for Optical Data Center Networks},
AUTHOR = {Lei, Yiming and De Marchi, Federico and Joshi, Raj and Li, Jialong and Chandrasekaran, Balakrishnan and Xia, Yiting},
LANGUAGE = {eng},
ISBN = {979-8-4007-0717-9},
DOI = {10.1145/3672202.3673712},
PUBLISHER = {ACM},
YEAR = {2024},
MARGINALMARK = {$\bullet$},
DATE = {2024},
BOOKTITLE = {ACM SIGCOMM Posters and Demos '24},
PAGES = {86--88},
ADDRESS = {Sydney, Australia},
}

Endnote

%0 Conference Proceedings
%A Lei, Yiming
%A De Marchi, Federico
%A Joshi, Raj
%A Li, Jialong
%A Chandrasekaran, Balakrishnan
%A Xia, Yiting
%+ Networks and Cloud Systems, MPI for Informatics, Max Planck Society
Networks and Cloud Systems, MPI for Informatics, Max Planck Society
External Organizations
Networks and Cloud Systems, MPI for Informatics, Max Planck Society
External Organizations
Networks and Cloud Systems, MPI for Informatics, Max Planck Society
%T DEMO: An Open Research Framework for Optical Data Center Networks : 
%G eng
%U http://hdl.handle.net/21.11116/0000-0010-43CF-A
%R 10.1145/3672202.3673712
%D 2024
%B 38th ACM SIGCOMM
%Z date of event: 2024-08-04 - 2024-08-08
%C Sydney, Australia
%B ACM SIGCOMM Posters and Demos  '24
%P 86 - 88
%I ACM
%@ 979-8-4007-0717-9

Paper

RG2

Y. Lei, J. Li, Z. Liu, R. Joshi, and Y. Xia

“Nanosecond Precision Time Synchronization for Optical Data Center Networks,” 2024. [Online]. Available: https://arxiv.org/abs/2410.17012.

mehr

Abstract

Optical data center networks (DCNs) are renovating the infrastructure design
for the cloud in the post Moore's law era. The fact that optical DCNs rely on
optical circuits of microsecond-scale durations makes nanosecond-precision time
synchronization essential for the correct functioning of routing on the network
fabric. However, current studies on optical DCNs neglect the fundamental need
for accurate time synchronization. In this paper, we bridge the gap by
developing Nanosecond Optical Synchronization (NOS), the first
nanosecond-precision synchronization solution for optical DCNs general to
various optical hardware. NOS builds clock propagation trees on top of the
dynamically reconfigured circuits in optical DCNs, allowing switches to seek
better sync parents throughout time. It predicts drifts in the tree-building
process, which enables minimization of sync errors. We also tailor today's sync
protocols to the needs of optical DCNs, including reducing the number of sync
messages to fit into short circuit durations and correcting timestamp errors
for higher sync accuracy. Our implementation on programmable switches shows
28ns sync accuracy in a 192-ToR setting.

BibTeX

@online{Lei_2410.17012,
TITLE = {Nanosecond Precision Time Synchronization for Optical Data Center Networks},
AUTHOR = {Lei, Yiming and Li, Jialong and Liu, Zhengqing and Joshi, Raj and Xia, Yiting},
LANGUAGE = {eng},
URL = {https://arxiv.org/abs/2410.17012},
EPRINT = {2410.17012},
EPRINTTYPE = {arXiv},
YEAR = {2024},
MARGINALMARK = {$\bullet$},
DATE = {2024},
ABSTRACT = {Optical data center networks (DCNs) are renovating the infrastructure design<br>for the cloud in the post Moore's law era. The fact that optical DCNs rely on<br>optical circuits of microsecond-scale durations makes nanosecond-precision time<br>synchronization essential for the correct functioning of routing on the network<br>fabric. However, current studies on optical DCNs neglect the fundamental need<br>for accurate time synchronization. In this paper, we bridge the gap by<br>developing Nanosecond Optical Synchronization (NOS), the first<br>nanosecond-precision synchronization solution for optical DCNs general to<br>various optical hardware. NOS builds clock propagation trees on top of the<br>dynamically reconfigured circuits in optical DCNs, allowing switches to seek<br>better sync parents throughout time. It predicts drifts in the tree-building<br>process, which enables minimization of sync errors. We also tailor today's sync<br>protocols to the needs of optical DCNs, including reducing the number of sync<br>messages to fit into short circuit durations and correcting timestamp errors<br>for higher sync accuracy. Our implementation on programmable switches shows<br>28ns sync accuracy in a 192-ToR setting.<br>},
}

Endnote

%0 Report
%A Lei, Yiming
%A Li, Jialong
%A Liu, Zhengqing
%A Joshi, Raj
%A Xia, Yiting
%+ Networks and Cloud Systems, MPI for Informatics, Max Planck Society
Networks and Cloud Systems, MPI for Informatics, Max Planck Society
External Organizations
External Organizations
Networks and Cloud Systems, MPI for Informatics, Max Planck Society
%T Nanosecond Precision Time Synchronization for Optical Data Center
  Networks : 
%G eng
%U http://hdl.handle.net/21.11116/0000-0010-43D5-2
%U https://arxiv.org/abs/2410.17012
%D 2024
%X   Optical data center networks (DCNs) are renovating the infrastructure design<br>for the cloud in the post Moore's law era. The fact that optical DCNs rely on<br>optical circuits of microsecond-scale durations makes nanosecond-precision time<br>synchronization essential for the correct functioning of routing on the network<br>fabric. However, current studies on optical DCNs neglect the fundamental need<br>for accurate time synchronization. In this paper, we bridge the gap by<br>developing Nanosecond Optical Synchronization (NOS), the first<br>nanosecond-precision synchronization solution for optical DCNs general to<br>various optical hardware. NOS builds clock propagation trees on top of the<br>dynamically reconfigured circuits in optical DCNs, allowing switches to seek<br>better sync parents throughout time. It predicts drifts in the tree-building<br>process, which enables minimization of sync errors. We also tailor today's sync<br>protocols to the needs of optical DCNs, including reducing the number of sync<br>messages to fit into short circuit durations and correcting timestamp errors<br>for higher sync accuracy. Our implementation on programmable switches shows<br>28ns sync accuracy in a 192-ToR setting.<br>
%K Computer Science, Networking and Internet Architecture, cs.NI

Paper

RG2

J. Li, F. De Marchi, Y. Lei, R. Joshi, B. Chandrasekara, and Y. Xia

“Unlocking Diversity of Fast-Switched Optical Data Center Networks with Unified Routing,” 2024. [Online]. Available: https://arxiv.org/abs/2412.00266.

mehr

Abstract

Optical data center networks (DCNs) are emerging as a promising solution for
cloud infrastructure in the post-Moore's Law era, particularly with the advent
of 'fast-switched' optical architectures capable of circuit reconfiguration at
microsecond or even nanosecond scales. However, frequent reconfiguration of
optical circuits introduces a unique challenge: in-flight packets risk loss
during these transitions, hindering the deployment of many mature optical
hardware designs due to the lack of suitable routing solutions. In this paper,
we present Unified Routing for Optical networks (URO), a general routing
framework designed to support fast-switched optical DCNs across various
hardware architectures. URO combines theoretical modeling of this novel routing
problem with practical implementation on programmable switches, enabling
precise, time-based packet transmission. Our prototype on Intel Tofino2
switches achieves a minimum circuit duration of 2us, ensuring end-to-end,
loss-free application performance. Large-scale simulations using production DCN
traffic validate URO's generality across different hardware configurations,
demonstrating its effectiveness and efficient system resource utilization.

BibTeX

@online{Li2412.00266,
TITLE = {Unlocking Diversity of Fast-Switched Optical Data Center Networks with Unified Routing},
AUTHOR = {Li, Jialong and De Marchi, Federico and Lei, Yiming and Joshi, Raj and Chandrasekara, Balakrishnan and Xia, Yiting},
LANGUAGE = {eng},
URL = {https://arxiv.org/abs/2412.00266},
EPRINT = {2412.00266},
EPRINTTYPE = {arXiv},
YEAR = {2024},
MARGINALMARK = {$\bullet$},
ABSTRACT = {Optical data center networks (DCNs) are emerging as a promising solution for<br>cloud infrastructure in the post-Moore's Law era, particularly with the advent<br>of 'fast-switched' optical architectures capable of circuit reconfiguration at<br>microsecond or even nanosecond scales. However, frequent reconfiguration of<br>optical circuits introduces a unique challenge: in-flight packets risk loss<br>during these transitions, hindering the deployment of many mature optical<br>hardware designs due to the lack of suitable routing solutions. In this paper,<br>we present Unified Routing for Optical networks (URO), a general routing<br>framework designed to support fast-switched optical DCNs across various<br>hardware architectures. URO combines theoretical modeling of this novel routing<br>problem with practical implementation on programmable switches, enabling<br>precise, time-based packet transmission. Our prototype on Intel Tofino2<br>switches achieves a minimum circuit duration of 2us, ensuring end-to-end,<br>loss-free application performance. Large-scale simulations using production DCN<br>traffic validate URO's generality across different hardware configurations,<br>demonstrating its effectiveness and efficient system resource utilization.<br>},
}

Endnote

%0 Report
%A Li, Jialong
%A De Marchi, Federico
%A Lei, Yiming
%A Joshi, Raj
%A Chandrasekara, Balakrishnan
%A Xia, Yiting
%+ Networks and Cloud Systems, MPI for Informatics, Max Planck Society
Networks and Cloud Systems, MPI for Informatics, Max Planck Society
Networks and Cloud Systems, MPI for Informatics, Max Planck Society
External Organizations
External Organizations
Networks and Cloud Systems, MPI for Informatics, Max Planck Society
%T Unlocking Diversity of Fast-Switched Optical Data Center Networks with
  Unified Routing : 
%G eng
%U http://hdl.handle.net/21.11116/0000-0010-BF18-D
%U https://arxiv.org/abs/2412.00266
%D 2024
%X   Optical data center networks (DCNs) are emerging as a promising solution for<br>cloud infrastructure in the post-Moore's Law era, particularly with the advent<br>of 'fast-switched' optical architectures capable of circuit reconfiguration at<br>microsecond or even nanosecond scales. However, frequent reconfiguration of<br>optical circuits introduces a unique challenge: in-flight packets risk loss<br>during these transitions, hindering the deployment of many mature optical<br>hardware designs due to the lack of suitable routing solutions. In this paper,<br>we present Unified Routing for Optical networks (URO), a general routing<br>framework designed to support fast-switched optical DCNs across various<br>hardware architectures. URO combines theoretical modeling of this novel routing<br>problem with practical implementation on programmable switches, enabling<br>precise, time-based packet transmission. Our prototype on Intel Tofino2<br>switches achieves a minimum circuit duration of 2us, ensuring end-to-end,<br>loss-free application performance. Large-scale simulations using production DCN<br>traffic validate URO's generality across different hardware configurations,<br>demonstrating its effectiveness and efficient system resource utilization.<br>
%K Computer Science, Networking and Internet Architecture, cs.NI

Conference paper

RG2

J. Li, H. Gong, F. De Marchi, A. Gong, Y. Lei, W. Bai, and Y. Xia

“Uniform-Cost Multi-Path Routing for Reconfigurable Data Center Networks,” in ACM SIGCOMM ’24, Sydney, Australia, 2024.

mehr

BibTeX

@inproceedings{LiSIGCOMM24,
TITLE = {Uniform-Cost Multi-Path Routing for Reconfigurable Data Center Networks},
AUTHOR = {Li, Jialong and Gong, Haotian and De Marchi, Federico and Gong, Aoyu and Lei, Yiming and Bai, Wei and Xia, Yiting},
LANGUAGE = {eng},
ISBN = {979-8-4007-0614-1},
DOI = {10.1145/3651890.3672245},
PUBLISHER = {ACM},
YEAR = {2024},
MARGINALMARK = {$\bullet$},
DATE = {2024},
BOOKTITLE = {ACM SIGCOMM '24},
PAGES = {433--448},
ADDRESS = {Sydney, Australia},
}

Endnote

%0 Conference Proceedings
%A Li, Jialong
%A Gong, Haotian
%A De Marchi, Federico
%A Gong, Aoyu
%A Lei, Yiming
%A Bai, Wei
%A Xia, Yiting
%+ Networks and Cloud Systems, MPI for Informatics, Max Planck Society
External Organizations
Networks and Cloud Systems, MPI for Informatics, Max Planck Society
External Organizations
External Organizations
External Organizations
Networks and Cloud Systems, MPI for Informatics, Max Planck Society
%T Uniform-Cost Multi-Path Routing for Reconfigurable Data Center Networks : 
%G eng
%U http://hdl.handle.net/21.11116/0000-000F-E324-8
%R 10.1145/3651890.3672245
%D 2024
%B ACM SIGCOMM Conference
%Z date of event: 2024-08-04 - 2024-08-08
%C Sydney, Australia
%B ACM SIGCOMM '24
%P 433 - 448
%I ACM
%@ 979-8-4007-0614-1

Conference paper

RG2

J. Xing, K.-F. Hsu, Y. Xia, Y. Cai, Y. Li, Y. Zhang, and A. Chen

“Occam: A Programming System for Reliable Network Management,” in EuroSys ’24, Nineteenth European Conference on Computer Systems, Athens, Greece, 2024.

mehr

BibTeX

@inproceedings{Xing_EuroSys24,
TITLE = {Occam: {A} Programming System for Reliable Network Management},
AUTHOR = {Xing, Jiarong and Hsu, Kuo-Feng and Xia, Yiting and Cai, Yan and Li, Yanping and Zhang, Ying and Chen, Ang},
LANGUAGE = {eng},
ISBN = {979-8-4007-0437-6},
DOI = {10.1145/3627703.3650086},
PUBLISHER = {ACM},
YEAR = {2024},
MARGINALMARK = {$\bullet$},
DATE = {2024},
BOOKTITLE = {EuroSys '24, Nineteenth European Conference on Computer Systems},
PAGES = {148--162},
ADDRESS = {Athens, Greece},
}

Endnote

%0 Conference Proceedings
%A Xing, Jiarong
%A Hsu, Kuo-Feng
%A Xia, Yiting
%A Cai, Yan
%A Li, Yanping
%A Zhang, Ying
%A Chen, Ang
%+ External Organizations
External Organizations
Networks and Cloud Systems, MPI for Informatics, Max Planck Society
External Organizations
External Organizations
External Organizations
External Organizations
%T Occam: A Programming System for Reliable Network Management : 
%G eng
%U http://hdl.handle.net/21.11116/0000-0010-43CB-E
%R 10.1145/3627703.3650086
%D 2024
%B Nineteenth European Conference on Computer Systems
%Z date of event: 2024-04-22 - 2024-04-25
%C Athens, Greece
%B EuroSys '24
%P 148 - 162
%I ACM
%@ 979-8-4007-0437-6

Conference paper

RG2

J. Li, Y. Lei, F. De Marchi, R. Joshi, B. Chandrasekaran, and Y. Xia

“Hop-On Hop-Off Routing: A Fast Tour across the Optical Data Center Network for Latency-Sensitive Flow,” in 6th Asia-Pacific Workshop on Networking (APNet 2022), Fuzhou, China, 2023.

mehr

BibTeX

@inproceedings{Li_APNet2022,
TITLE = {{Hop-On Hop-Off} Routing: {A} Fast Tour across the Optical Data Center Network for Latency-Sensitive Flow},
AUTHOR = {Li, Jialong and Lei, Yiming and De Marchi, Federico and Joshi, Raj and Chandrasekaran, Balakrishnan and Xia, Yiting},
LANGUAGE = {eng},
ISBN = {978-1-4503-9748-3},
DOI = {10.1145/3542637.3542647},
PUBLISHER = {ACM},
YEAR = {2022},
MARGINALMARK = {$\bullet$},
DATE = {2023},
BOOKTITLE = {6th Asia-Pacific Workshop on Networking (APNet 2022)},
PAGES = {63--69},
ADDRESS = {Fuzhou, China},
}

Endnote

%0 Conference Proceedings
%A Li, Jialong
%A Lei, Yiming
%A De Marchi, Federico
%A Joshi, Raj
%A Chandrasekaran, Balakrishnan
%A Xia, Yiting
%+ Networks and Cloud Systems, MPI for Informatics, Max Planck Society
Networks and Cloud Systems, MPI for Informatics, Max Planck Society
External Organizations
External Organizations
External Organizations
Networks and Cloud Systems, MPI for Informatics, Max Planck Society
%T Hop-On Hop-Off Routing: A Fast Tour across the Optical Data Center Network for Latency-Sensitive Flow  : 
%G eng
%U http://hdl.handle.net/21.11116/0000-000C-7CD6-8
%R 10.1145/3542637.3542647
%D 2023
%B 6th Asia-Pacific Workshop on Networking
%Z date of event: 2022-07-01 - 2022-07-02
%C Fuzhou, China
%B 6th Asia-Pacific Workshop on Networking
%P 63 - 69
%I ACM
%@ 978-1-4503-9748-3
%U https://doi.org/10.1145/3542637.3542647

Conference paper

RG2

J. Li, K. Zhu, N. Hua, C. Zhao, Y. Li, X. Zheng, and B. Zhou

“Joint Optimization of Multidimensional Resources Allocation in Cloud Networking,” in The 7th Optoelectronics Global Conference (OGC 2022), Shenzhen, China, 2022.

mehr

BibTeX

@inproceedings{LIOG22,
TITLE = {Joint Optimization of Multidimensional Resources Allocation in Cloud Networking},
AUTHOR = {Li, Jialong and Zhu, Kangqi and Hua, Nan and Zhao, Chen and Li, Yanhe and Zheng, Xiaoping and Zhou, Bingkun},
LANGUAGE = {eng},
ISBN = {978-1-6654-8698-9},
DOI = {0.1109/OGC55558.2022.10050986},
PUBLISHER = {IEEE},
YEAR = {2022},
BOOKTITLE = {The 7th Optoelectronics Global Conference (OGC 2022)},
PAGES = {55--59},
ADDRESS = {Shenzhen, China},
}

Endnote

%0 Conference Proceedings
%A Li, Jialong
%A Zhu, Kangqi
%A Hua, Nan
%A Zhao, Chen
%A Li, Yanhe
%A Zheng, Xiaoping
%A Zhou, Bingkun
%+ Networks and Cloud Systems, MPI for Informatics, Max Planck Society
External Organizations
External Organizations
External Organizations
External Organizations
External Organizations
External Organizations
%T Joint Optimization of Multidimensional Resources Allocation in Cloud Networking : 
%G eng
%U http://hdl.handle.net/21.11116/0000-000D-5762-3
%R 0.1109/OGC55558.2022.10050986
%D 2022
%B The 7th Optoelectronics Global Conference
%Z date of event: 2022-12-06 - 2022-12-11
%C Shenzhen, China
%B The 7th Optoelectronics Global Conference
%P 55 - 59
%I IEEE
%@ 978-1-6654-8698-9

Conference paper

RG2

R. Pan, Y. Lei, J. Li, Z. Xie, B. Yuan, and Y. Xia

“Efficient Flow Scheduling in Distributed Deep Learning Training with Echelon Formation,” in HotNets ’22, 21st ACM Workshop on Hot Topics in Networks, Austin, TX, USA, 2022.

mehr

BibTeX

@inproceedings{echelon,
TITLE = {Efficient Flow Scheduling in Distributed Deep Learning Training with Echelon Formation},
AUTHOR = {Pan, Rui and Lei, Yiming and Li, Jialong and Xie, Zhiqiang and Yuan, Binhang and Xia, Yiting},
LANGUAGE = {eng},
ISBN = {978-1-4503-9899-2},
DOI = {10.1145/3563766.3564096},
PUBLISHER = {ACM},
YEAR = {2022},
BOOKTITLE = {HotNets '22, 21st ACM Workshop on Hot Topics in Networks},
PAGES = {93--100},
ADDRESS = {Austin, TX, USA},
}

Endnote

%0 Conference Proceedings
%A Pan, Rui
%A Lei, Yiming
%A Li, Jialong
%A Xie, Zhiqiang
%A Yuan, Binhang
%A Xia, Yiting
%+ External Organizations
Networks and Cloud Systems, MPI for Informatics, Max Planck Society
Networks and Cloud Systems, MPI for Informatics, Max Planck Society
External Organizations
External Organizations
Networks and Cloud Systems, MPI for Informatics, Max Planck Society
%T Efficient Flow Scheduling in Distributed Deep Learning Training with Echelon Formation : 
%G eng
%U http://hdl.handle.net/21.11116/0000-000C-1822-3
%R 10.1145/3563766.3564096
%D 2022
%B 21st ACM Workshop on Hot Topics in Networks
%Z date of event: 2022-11-14 - 2022-11-15
%C Austin, TX, USA
%B HotNets '22
%P 93 - 100
%I ACM
%@ 978-1-4503-9899-2

Conference paper

RG2

S. S. Ahuja, V. Gupta, V. Dangui, S. Bali, A. Gopalan, H. Zhong, P. Lapukhov, Y. Xia, and Y. Zhang

“Capacity-efficient and Uncertainty-resilient Backbone Network Planning with Hose,” in SIGCOMM ’21, Virtual Event, USA, 2021.

mehr

BibTeX

@inproceedings{Ahuja_SIGCOMM21,
TITLE = {Capacity-efficient and Uncertainty-resilient Backbone Network Planning with Hose},
AUTHOR = {Ahuja, Satyajeet Singh and Gupta, Varun and Dangui, Vinayak and Bali, Soshant and Gopalan, Abishek and Zhong, Hao and Lapukhov, Petr and Xia, Yiting and Zhang, Ying},
LANGUAGE = {eng},
ISBN = {978-1-4503-8383-7},
DOI = {10.1145/3452296.3472918},
PUBLISHER = {ACM},
YEAR = {2021},
BOOKTITLE = {SIGCOMM '21},
EDITOR = {Kuipers, Fernando and Caesar, Matthew},
PAGES = {547--559},
ADDRESS = {Virtual Event, USA},
}

Endnote

%0 Conference Proceedings
%A Ahuja, Satyajeet Singh
%A Gupta, Varun
%A Dangui, Vinayak
%A Bali, Soshant
%A Gopalan, Abishek
%A Zhong, Hao
%A Lapukhov, Petr
%A Xia, Yiting
%A Zhang, Ying
%+ External Organizations
External Organizations
External Organizations
External Organizations
External Organizations
External Organizations
External Organizations
Networks and Cloud Systems, MPI for Informatics, Max Planck Society
External Organizations
%T Capacity-efficient and Uncertainty-resilient Backbone Network Planning with Hose : 
%G eng
%U http://hdl.handle.net/21.11116/0000-0009-74AF-0
%R 10.1145/3452296.3472918
%D 2021
%B ACM SIGCOMM Conference
%Z date of event: 2021-08-23 - 2021-08-27
%C Virtual Event, USA
%B SIGCOMM '21
%E Kuipers, Fernando; Caesar, Matthew
%P 547 - 559
%I ACM
%@ 978-1-4503-8383-7

Conference paper

RG2

Y. Xia, Y. Zhang, Z. Zhong, G. Yan, C. Lim, S. S. Ahuja, S. Bali, A. Nikolaidis, K. Ghobadi, and M. Ghobadi

“A Social Network Under Social Distancing: Risk-Driven Backbone Management During COVID-19 and Beyond,” in Proceedings of the 18th USENIX Symposium on Networked Systems Design and Implementation, Virtual Event, 2021.

mehr

BibTeX

@inproceedings{Xia_NSDI21,
TITLE = {A Social Network Under Social Distancing: {R}isk-Driven Backbone Management During {COVID}-19 and Beyond},
AUTHOR = {Xia, Yiting and Zhang, Ying and Zhong, Zhizhen and Yan, Guanqing and Lim, Chiunlin and Ahuja, Satyajeet Singh and Bali, Soshant and Nikolaidis, Alexander and Ghobadi, Kimia and Ghobadi, Manya},
LANGUAGE = {eng},
ISBN = {978-1-939133-21-2},
PUBLISHER = {USENIX Association},
YEAR = {2021},
BOOKTITLE = {Proceedings of the 18th USENIX Symposium on Networked Systems Design and Implementation},
PAGES = {217--231},
ADDRESS = {Virtual Event},
}

Endnote

%0 Conference Proceedings
%A Xia, Yiting
%A Zhang, Ying
%A Zhong, Zhizhen
%A Yan, Guanqing
%A Lim, Chiunlin
%A Ahuja, Satyajeet Singh
%A Bali, Soshant
%A Nikolaidis, Alexander
%A Ghobadi, Kimia
%A Ghobadi, Manya
%+ Networks and Cloud Systems, MPI for Informatics, Max Planck Society
External Organizations
External Organizations
External Organizations
External Organizations
External Organizations
External Organizations
External Organizations
External Organizations
External Organizations
%T A Social Network Under Social Distancing: Risk-Driven Backbone Management During COVID-19 and Beyond : 
%G eng
%U http://hdl.handle.net/21.11116/0000-0009-74AB-4
%D 2021
%B 18th USENIX Symposium on Networked Systems Design and Implementation
%Z date of event: 2021-04-12 - 2021-04-14
%C Virtual Event
%B Proceedings of the 18th USENIX Symposium on Networked Systems Design and Implementation 
%P 217 - 231
%I USENIX Association
%@ 978-1-939133-21-2

Conference paper

RG2

Z. Zhong, M. Ghobadi, A. Khaddaj, J. Leach, Y. Xia, and Y. Zhang

“ARROW: Restoration-Aware Traffic Engineering,” in SIGCOMM ’21, Virtual Event, USA, 2021.

mehr

BibTeX

@inproceedings{Zhong_SIGCOMM21,
TITLE = {{ARROW}: {R}estoration-Aware Traffic Engineering},
AUTHOR = {Zhong, Zhizhen and Ghobadi, Manya and Khaddaj, Alaa and Leach, Jonathan and Xia, Yiting and Zhang, Ying},
LANGUAGE = {eng},
ISBN = {978-1-4503-8383-7},
DOI = {10.1145/3452296.3472921},
PUBLISHER = {ACM},
YEAR = {2021},
BOOKTITLE = {SIGCOMM '21},
EDITOR = {Kuipers, Fernando and Caesar, Matthew},
PAGES = {560--579},
ADDRESS = {Virtual Event, USA},
}

Endnote

%0 Conference Proceedings
%A Zhong, Zhizhen
%A Ghobadi, Manya
%A Khaddaj, Alaa
%A Leach, Jonathan
%A Xia, Yiting
%A Zhang, Ying
%+ External Organizations
External Organizations
External Organizations
External Organizations
Networks and Cloud Systems, MPI for Informatics, Max Planck Society
External Organizations
%T ARROW: Restoration-Aware Traffic Engineering : 
%G eng
%U http://hdl.handle.net/21.11116/0000-0009-7533-A
%R 10.1145/3452296.3472921
%D 2021
%B ACM SIGCOMM Conference
%Z date of event: 2021-08-23 - 2021-08-27
%C Virtual Event, USA
%B SIGCOMM '21
%E Kuipers, Fernando; Caesar, Matthew
%P 560 - 579
%I ACM
%@ 978-1-4503-8383-7