NCCL2 Survey - PaddlePaddle/Paddle GitHub Wiki
- NCCL 1.0 support one machine multi devices, devices use PCIe、NVlink、GPU Direct P2P to communicate with each other.
- NCCL 2.0 support multi machines,machines use Sockets(Ethernet) or InfiniBand with GPU Direct RDMA to communicate with each other。
- Multi-gpu and multi-node communication collectives such as:
- all-gather
- all-reduce
- broadcast
- reduce
- reduce-scatter
- Automatic topology detection to determine optimal communication path Optimized to achieve high bandwidth over PCIe and NVLink high-speed interconnect
- Support multi-threaded and multiprocess applications Multiple ring formations for high bus utilization
- Support for InfiniBand verbs, RoCE and IP Socket internode communication
- Delivers over 90% multi-node scaling efficiency using up to eight GPU-accelerated servers
- Performs automatic topology detection to determine optimal communication path
- Optimized to achieve high bandwidth over PCIe and NVLink high-speed interconnect
- Support for RDMA over Converged Ethernet (RoCE) network protocol for multi-node scaling
-
NCCL 2.0 is not open sourced, it provides a deb install file, install guide. It contains a header file and two lib:
/usr/lib/x86_64-linux-gnu/libnccl_static.a /usr/lib/x86_64-linux-gnu/libnccl.so -> libnccl.so.2 /usr/include/nccl.h
-
use CUDA8 or above
-
only support Ubuntu 14.04/Ubuntu 16.04/agnostic Operating system.
-
does not support auto scaling
-
does not support customized communication group
NCCL follows the MPI collectives API fairly closely. Before any collectives can be called, a communicator
object must be initialized on each GPU.
On a single-process machine, all GPUs can be conveniently initialized using ncclCommInitAll
. For multi-process applications (e.g., with MPI), ncclCommInitRank must be called for each GPU. Internally ncclCommInitRank
invokes a synchronization among all GPUs, so these calls must be invoked in different host threads (or processes) for each GPU. A brief single-process example follows, for an MPI example see test/mpi/mpi_test.cu. For details about the API see nccl.h.
#include <nccl.h>
typedef struct {
double* sendBuff;
double* recvBuff;
int size;
cudaStream_t stream;
} PerThreadData;
int main(int argc, char* argv[])
{
int nGPUs;
cudaGetDeviceCount(&nGPUs);
ncclComm_t* comms = (ncclComm_t*)malloc(sizeof(ncclComm_t)*nGPUs);
ncclCommInitAll(comms, nGPUs); // initialize communicator
// One communicator per process
PerThreadData* data;
... // Allocate data and issue work to each GPU's
// perDevStream to populate the sendBuffs.
for(int i=0; i<nGPUs; ++i) {
cudaSetDevice(i); // Correct device must be set
// prior to each collective call.
ncclAllReduce(data[i].sendBuff, data[i].recvBuff, size,
ncclDouble, ncclSum, comms[i], data[i].stream);
}
... // Issue work into data[*].stream to consume buffers, etc.
}
Pytorch distributes with nccl 2, wrap it as Python API
https://github.com/pytorch/pytorch/blob/master/torch/csrc/cuda/Module.cpp#L420 https://github.com/pytorch/pytorch/blob/master/torch/cuda/nccl.py
It can also compile with NCCL2 without any modifications.
for distribute training, pytorch write a module named THD
Tensorflow use NCCL 1 as contrib module, and also wrap it in Python API https://www.tensorflow.org/api_docs/python/tf/contrib/nccl
horovod is based on Tensorflow and NCCL 2 to improve the performance of tensorflow distributed training.
NCCL 2 introduced the ability to run ring-allreduce across multiple machines, enabling us to take advantage of its many performance boosting optimizations.