Distributed Deep Learning - acmsigcomm19hackathon/hackathonprojects GitHub Wiki

Network Measurement in Distributed Deep Learning

Introduction

Distributed deep learning has become one of the most bandwidth-demanding applications in data centers. This Hackathon challenge aims to gain deeper understanding on how distributed deep learning jobs use the network and identify the bottleneck. We focus on training.

Popular deep learning frameworks, like TensorFlow, PyTorch and MXNet, support multiple distributed modes (data parallel or model parallel), multiple communication patterns (all-reduce or parameter servers) and various models (ResNet, VGG, Transformer, BERT, etc.) Hackathon participants are asked to measure and analyze the network usage for at least one combination of the above options.

The final experiment will run on GPU machines with traditional TCP/IP network (no RDMA). However, early development would only require CPU machines that can run the popular DNN models with very small batch sizes and synthetic data. The training jobs will be provided and based on public examples.

Hackathon Projects

You can choose to work on one of the following problems. On top of the below requirements, your solution will also be evaluated based on how generic it is. E.g., a system that works across different frameworks, communication patterns and models is preferred over a solution for a specific job.

Problem 1 (Easy): Fine-grained network trace collection and visualization

In a distributed deep learning job, an iteration usually lasts for hundreds of milliseconds. Within each iteration, GPU workers must send and receive up to thousands of tensors. Can you collect the fine-grained network traces, like when the tensors are ready to be sent but queued in local network stack, when they are actually sent on the wire, and when the sending finishes? Please visualize the traces in a form that helps other researchers understand the network characteristics in distributed training.

A bonus point would be to develop a framework that can replay the traces you collect, without real GPUs and jobs. This would be super helpful for future research and the community. If you are considering this, please also check out Problem 3.

Problem 2 (Advanced): Identifying stragglers / network congestion

In a large-scale distributed job running on a data center network, stragglers and (sometimes transient) network congestion is very common. Can your measurement system identify who is the straggler or between which two hosts there is bandwidth issue that causes the whole job slowdown? Can you analyze what is the end-to-end impact on training speed, compared with

A bonus point would be to develop mitigation mechanisms.

Problem 3 (Hard): Computation-communication co-measurement and analysis

The communication in distributed deep learning is a part of the whole computation graph (a DAG). The computation ops generate the tensor data to be sent, while the tensor data is needed for further computation. Sometimes, the communication op and computation op can be (partially) overlapped. Can you measure both the computation and communication ops, find out the dependencies between them? Such dependencies would also be important for replaying network traces (see Problem 1) with high fidelity.

Based on the co-analysis, you may provide hints for computation graph optimization. For example, changing the execution order of the computation DAG may lead to different training speed. Can you identify the critical path in the DAG? Can your system automatically determine whether a job is computation-bound (i.e., the critical path is dominated by GPU computation) or communication-bound? If one wants to apply tensor compression to save network bandwidth, can you find out the most important tensor to be compressed?

The final bonus point would be to make this system generic -- can your solution work across different frameworks like TensorFlow, PyTorch and MXNet?