Skip to content

Optimizing Network Performance for Large-Scale Distributed Machine Learning

Notifications You must be signed in to change notification settings

cap-ntu/Ratatouille

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

11 Commits
 
 
 
 
 
 

Repository files navigation

Ratatouille

Optimizing Network Performance for Large-Scale Distributed Machine Learning.

The Parameter Server (PS) framework is widely used to train machine learning (ML) models in parallel. It tackles the big data problem by having worker nodes performing data-parallel computation, and having server nodes maintaining globally shared parameters. When training big models, worker nodes frequently pull parameters from server nodes and push updates to server nodes, resulting in high communication overhead. Our investigations showed that modern distributed ML applications could spend up to 5.7 times more time on communication than computation.

To address this problem, we propose a novel communication layer for the PS framework using self-imposed sparsity. First, we introduce an update-centric communication model to exchange data between worker and server nodes via two operations: broadcast and push. With this model, only updates are transmitted over network. Second, we develop a dynamic value-bounded filter to reduce network traffic by selectively dropping updates before transmission. With this filter, server and worker nodes only need to transmit a small portion of updates during each broadcast and push operations. Theoretical analysis shows that our approach could reduce the network traffic and communication time significantly with convenience guarantees. Existing extensive performance evaluations showed that PF could speed up popular distributed ML applications by a factor of up to 4.3, compared to the conventional PS framework.

About

Optimizing Network Performance for Large-Scale Distributed Machine Learning

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published