NCCL2 Survey - PaddlePaddle/Paddle GitHub Wiki

NCCL2 Survey

  • NCCL 1.0 support one machine multi devices, devices use PCIe、NVlink、GPU Direct P2P to communicate with each other.
  • NCCL 2.0 support multi machines,machines use Sockets(Ethernet) or InfiniBand with GPU Direct RDMA to communicate with each other。

Feature of NCCL2

  • Multi-gpu and multi-node communication collectives such as:
    • all-gather
    • all-reduce
    • broadcast
    • reduce
    • reduce-scatter
  • Automatic topology detection to determine optimal communication path Optimized to achieve high bandwidth over PCIe and NVLink high-speed interconnect
  • Support multi-threaded and multiprocess applications Multiple ring formations for high bus utilization
  • Support for InfiniBand verbs, RoCE and IP Socket internode communication

Advantage

  • Delivers over 90% multi-node scaling efficiency using up to eight GPU-accelerated servers
  • Performs automatic topology detection to determine optimal communication path
  • Optimized to achieve high bandwidth over PCIe and NVLink high-speed interconnect
  • Support for RDMA over Converged Ethernet (RoCE) network protocol for multi-node scaling

Problems

  • NCCL 2.0 is not open sourced, it provides a deb install file, install guide. It contains a header file and two lib:

    /usr/lib/x86_64-linux-gnu/libnccl_static.a
    /usr/lib/x86_64-linux-gnu/libnccl.so -> libnccl.so.2
    /usr/include/nccl.h
  • use CUDA8 or above

  • only support Ubuntu 14.04/Ubuntu 16.04/agnostic Operating system.

  • does not support auto scaling

  • does not support customized communication group

Usage

NCCL follows the MPI collectives API fairly closely. Before any collectives can be called, a communicator object must be initialized on each GPU.

On a single-process machine, all GPUs can be conveniently initialized using ncclCommInitAll. For multi-process applications (e.g., with MPI), ncclCommInitRank must be called for each GPU. Internally ncclCommInitRank invokes a synchronization among all GPUs, so these calls must be invoked in different host threads (or processes) for each GPU. A brief single-process example follows, for an MPI example see test/mpi/mpi_test.cu. For details about the API see nccl.h.

#include <nccl.h>

typedef struct {
  double* sendBuff;
  double* recvBuff;
  int size;
  cudaStream_t stream;
} PerThreadData;

int main(int argc, char* argv[])
{
  int nGPUs;
  cudaGetDeviceCount(&nGPUs);
  ncclComm_t* comms = (ncclComm_t*)malloc(sizeof(ncclComm_t)*nGPUs);
  ncclCommInitAll(comms, nGPUs); // initialize communicator
                                // One communicator per process

  PerThreadData* data;

  ... // Allocate data and issue work to each GPU's
      // perDevStream to populate the sendBuffs.

  for(int i=0; i<nGPUs; ++i) {
    cudaSetDevice(i); // Correct device must be set
                      // prior to each collective call.
    ncclAllReduce(data[i].sendBuff, data[i].recvBuff, size,
        ncclDouble, ncclSum, comms[i], data[i].stream);
  }

  ... // Issue work into data[*].stream to consume buffers, etc.
}

Platform that use NCCL

Pytorch

Pytorch distributes with nccl 2, wrap it as Python API

https://github.com/pytorch/pytorch/blob/master/torch/csrc/cuda/Module.cpp#L420 https://github.com/pytorch/pytorch/blob/master/torch/cuda/nccl.py

It can also compile with NCCL2 without any modifications. for distribute training, pytorch write a module named THD

Tensorflow

Tensorflow use NCCL 1 as contrib module, and also wrap it in Python API https://www.tensorflow.org/api_docs/python/tf/contrib/nccl

Uber horovod

horovod is based on Tensorflow and NCCL 2 to improve the performance of tensorflow distributed training.

NCCL 2 introduced the ability to run ring-allreduce across multiple machines, enabling us to take advantage of its many performance boosting optimizations.

refs

⚠️ **GitHub.com Fallback** ⚠️