Environment Variables - aws/aws-ofi-nccl GitHub Wiki

Runtime Configurations

Parameter	Description	Type	Accepted Value
`OFI_NCCL_USE_IPV6_TCP`	Allow using endpoints with IPv6 addressing format for TCP provider. Users can specify to use a preferred libfabric provider with `FI_PROVIDER` environment variable.	Boolean	0/1 (Default: 0)
`OFI_NCCL_EXCLUDE_TCP_IF`	List of interface names to be filtered out for TCP provider. Users can specify to use a preferred libfabric provider with `FI_PROVIDER` environment variable.	String	Comma-separated list of interface names (Default: "lo,docker0")
`OFI_NCCL_GDR_FLUSH_DISABLE`	Disable flush operation when using GPUDirect.	Boolean	0/1 (Default: 0)
`OFI_NCCL_DISABLE_DMABUF`	Disable DMABUF Support	Boolean	0/1 (Default 0)
`OFI_NCCL_NIC_DUP_CONNS`	Set number of NIC connections. This is used to increase hardware utilization. Applicable for P3Dn when using less number of GPUs than 8..	Integer	x, to set x number of connections. Only overridden for greater than 0 values (Default: 0)
`OFI_NCCL_CUDA_FLUSH_ENABLE`	When using GPUDirect use the cudaDeviceFlushGPUDirectRDMAWrites to enforce data consistency at the receiving GPU. Requires CUDA 11.3 or later. Note that this function only provides a GPU memory fence and requires that data has already been delivered to GPU memory. Some networks and PCIe configurations require an additional network-level flush that is not provided by this option.	Boolean	0/1 (Default: 0)
`OFI_NCCL_CQ_READ_COUNT`	Adjust the maximum number of completion entries that will be read in a single Libfabric polling loop. In general, users should not have to adjust this value. An array of completion queue entry structures is created on the stack, so large (over 16-32) values of this parameter may cause stack overflows.	Integer	Default: 4
`OFI_NCCL_PROTOCOL`	Protocol to use for send/recv operations. Valid options are SENDRECV and RDMA. Default to a nonsense name, as protocol selection is based on rail config and system support. The SENDRECV implementation uses the Libfabric tagged send/recv interface and will give the best performance on hardware that implements tagged sends natively, and likely most Libfabric implementations that include an eager send optimization for GPU buffers. The RDMA option implements a sender-managed receive queue using RDMA write operations and supports multi-rail channels per GPU. The RDMA protocol is likely to work better than SENDRECV on networks that do not have an eager optimization or that have multiple NICs per GPU.	String	Default: default
`OFI_NCCL_MIN_STRIPE_SIZE`	Adjust the maximum size of `RDMA` protocol messages that are assigned to multi-rail channels in round-robin mode. Messages larger than the threshold are multiplexed over all channels to increase network throughput. In general, users should not have to adjust this value. A very small threshold may cause the `RDMA` protocol initialization fail since RDMA protocol control messages shall not be multiplexed.	Integer	Default: 128KiB
`OFI_NCCL_NET_LATENCY`	Internode network latency in us reported to NCCL.	Float	Any non-negative float value. Default: 0.0
`OFI_NCCL_EAGER_MAX_SIZE`	Eager message size limit when using RDMA protocol. Message sizes greater than this limit will always be sent using RDMA write instead of eagerly. By default, eager mode is enabled	Integer	Default: 8192, indicating eager mode is enabled. Setting to any non-negative integer <= ROUND_ROBIN_THRESHOLD will limit eager mode to only messages with size less than OFI_NCCL_EAGER_MAX_SIZE. Setting to -1 will disable eager mode.
`OFI_NCCL_DISABLE_GDR_REQUIRED_CHECK`	Disable the check for required GDR support on EC2 instances. When this check is disabled, the plugin can be used without GDR support even on platforms that support GDR (P4d and later). By default, the plugin performs the check.	Boolean	0/1 (Default: 0)
`OFI_NCCL_MR_KEY_SIZE`	Specify the memory registration key size in bytes when using a libfabric provider that supports application-selected memory registration keys.	Integer	Default: 2
`OFI_NCCL_MR_CACHE_DISABLE`	Disable the MR cache. The MR cache is used to keep track of registered memory regions, so that calling regMr() on the same buffer (address and size), will quickly return a previously globally registered MR on that buffer, avoiding redundant (and expensive) registrations with the underlying device. Disabling the MR cache will make all calls to regMR() result in a registration with the device, so it may cause a significant performance degradation.	Boolean	0/1 (Default: 0)
`OFI_NCCL_DOMAIN_PER_THREAD`	By default, the plugin creates one Libfabric domain per process. On AWS Tranium instances, it creates one domain per thread instead. This variable can override the default behavior.	Integer	default:-1 (unset default): use the platform-specific configuration. 0: Allocate one domain per process1: Allocate one domain per thread
`OFI_NCCL_DISABLE_NATIVE_RDMA_CHECK`	On AWS platforms the plugin checks for native RDMA write support when using the RDMA protocol. This variable can disable this check to allow using the RDMA protocol even on platforms where native RDMA write is not supported (or cannot be verified to be supported).	Boolean	0/1 (Default: 0)
`OFI_NCCL_DISABLE_GDR_REQUIRED_CHECK`	Disable the check for required GDR support on AWS instances. When this check is disabled, the plugin can be used without GDR support even on platforms that support GDR (P4d and later).	Boolean	0/1 (Default: 0)
`OFI_NCCL_ERRORCHECK_MUTEX`	If non-zero, fail if a thread attempts to relock a mutex that it has already locked (used for debug).	Boolean	Default:1 if debugging is enabled,0 otherwise
`OFI_NCCL_ENDPOINT_PER_COMM`	If zero, create a Libfabric endpoint per domain, shared across all communicators. If non-zero, create different endpoints for receive communicators connected to the same source endpoint, while using a shared completion queue.	Boolean	0/1 (Default: 0)
`OFI_NCCL_EARLY_COMPLETION`	If zero, disable receive request early completion. If non-zero, receiver side marks request completion immediately after CTRL message send completion and does not wait for RDMA write operation completion By default, it depends on provider's data progress model, if FI_PROGRESS_AUTO, early completion is enabled. Otherwise, disabled	Integer	Default: -1
`OFI_NCCL_FORCE_NUM_RAILS`	Number of rails that the rdma transport should build. If the number of rails is more than the number of NICs, then the number of rails must be a multiple of the number of NICs. If set to 0, the rdma transport will allow heterogenous groupings of network interfaces across devices.	Integer	Any non-negative integer. Defaults to 0.
`OFI_NCCL_CQ_SIZE`	Completion queue length requested from Libfabric. Defaults to 12,288 entries based on a historical guess. Needs to be set to at least `num_communicators * (OFI_NCCL_RDMA_MAX_POSTED_EAGER_BUFFERS + (NCCL_MAX_REQUESTS * 2)) + OFI_NCCL_RDMA_MAX_POSTED_CONROL_BUFFERS` to avoid CQ overflows or receiver not ready errors due to CQ exhaustion.	Integer	Default: 12288
`OFI_NCCL_SCHED_MAX_SMALL_RR_SIZE`	The RDMA transport round robin scheduler has two round robin counts, for small (likely control) and medium (likely data) messages. Small messages are always handled with a single transmission on the current small message round-robin rail. Medium messages are striped across multiple rails if their size is greater than OFI_NCCL_MIN_STRIPE_SIZE, otherwise they are also sent in one transmission on the current medium message round-robin rail. This parameter moves the threshold for the maximum "small" message size. Messages with the exact size as the threshold value are considered medium messages.	Integer	Default: 64
`OFI_NCCL_RDMA_MIN_POSTED_EAGER_BUFFERS`	Minimum eager rx buffers posted per endpoint. The plugin will attempt to post more eager rx buffers if we dip below this threshold, allocating eager new rx buffers if needed.	Integer	Default: 64
`OFI_NCCL_RDMA_MAX_POSTED_EAGER_BUFFERS`	Maximum eager rx buffers posted per endpoint. The plugin will not attempt to post more eager rx buffers if we reach this threshold, returning available eager rx buffers to the free list if needed.	Integer	Default: 128
`OFI_NCCL_RDMA_MIN_POSTED_CONTROL_BUFFERS`	Minimum control rx buffers posted per endpoint. The plugin will attempt to post more control rx buffers if we dip below this threshold, allocating new control rx buffers if needed.	Integer	Default: 1920
`OFI_NCCL_RDMA_MAX_POSTED_CONTROL_BUFFERS`	Maximum control rx buffers posted per endpoint. The plugin will not attempt to post more control rx buffers if we reach this threshold, returning available control rx buffers to the free list if needed.	Integer	Default: 2048
`OFI_NCCL_RR_CTRL_MSG`	Whether to spread control messages across multiple rails in round robin fashion or send it consistently on one rail.	Boolean	0/1 (Default: 1)
`OFI_NCCL_CM_NUM_RX_BUFFERS`	Number of RX buffers to post for the connection manager endpoint (for connection establishment). Posting buffers will use more memory, but may make connection establishment complete more quickly, especially with large numbers of ranks.	Integer	Default: 32
`OFI_NCCL_DISABLE_CLOSE_MESSAGE`	The close message was intended to enable a future optimization to the plugin's eager protocol (not yet implemented) where sender will not wait to receive a control message from receiver before marking a send complete. Instead, sender waits for a close message when the communicator is closed, indicating it is safe to close the communicator resources. During testing of fault-tolerance (NCCL restart after abort), situations were found where the plugin hangs while waiting for a close message, specifically when some ranks enter an abort state (due to having inflight requests) and some don't. Until there is a long-term fix for this, the close message is disabled by default.	Boolean	0/1 (Default: 1)

Note: Similar to NCCL or Libfabric, the plugin dynamically loads CUDA dependencies at runtime, specifically libcuda.so. Like NCCL and Libfabric, the plugin does not find CUDA libraries with the CUDA_HOME environment variable. dlopen() will use the LD_LIBRARY_PATH environment variable and then your system's default search path to find libcuda.so. We do this to match NCCL and Libfabric behaviors so that all three components find the same CUDA installation.

Deprecated OFI NCCL plugin Runtime Configurations

Parameter	Description	Deprecated Since	Type	Accepted Value
`OFI_NCCL_RDMA_MIN_POSTED_BOUNCE_BUFFERS`	Deprecated value for the minimum number of rx buffers posted to be used by both eager and control messages. RDMA transport plugin initialization will fail if set.	1.15.0	Integer	Default: -1 (unset default)
`OFI_NCCL_RDMA_MAX_POSTED_BOUNCE_BUFFERS`	Deprecated value for the maximum number of rx buffers posted to be used by both eager and control messages. RDMA transport plugin initialization will fail if set.	1.15.0	Integer	Default: -1 (unset default)

Libfabric Runtime Configurations for EFA

Parameter	Description
`FI_EFA_USE_HUGE_PAGE=0`	Set to 0 when Python's `os.fork()` causes `OSError: Cannot allocate memory` errors. Typically caused by multi-process PyTorch data loaders. Disabling huge page causes minor performance hit, but it's needed to prevent fork fails due to the operating system running out of hugepages.
`FI_EFA_FORK_SAFE=1`	Requests Libfabric to enable fork-safe support in legacy versions of the rdma-core library. Libfabric checks if additional handling is required for fork safety, and does not introduce this additional overhead of setting `MADV_DONTFORK` for new versions of rdma-core (38.0 and later) and the Linux kernel that support copy-on-fork for pinned memory (5.13 and later). These new versions are always fork-safe and additional support in userspace is not required. When legacy versions of the kernel and rdma-core are used, setting FI_EFA_FORK_SAFE to 1 disables the use of huge pages in Libfabric. To prevent data corruption, the EFA provider registers an atfork handler which will abort the process whenever it believes rdma-core is not fork-safe. NCCL applications heavily re-use the buffers for communication and thus are not sensitive to increased memory registration costs. To prevent NCCL based applications from getting aborted when using fork(), plugin versions newer than v1.6 explicitly enables FI_EFA_FORK_SAFE, even in legacy environments where the overhead is high. Users running a version older than that might want to set it to 1 in their environment

Environment Variables - aws/aws-ofi-nccl GitHub Wiki

Runtime Configurations

Deprecated OFI NCCL plugin Runtime Configurations

Libfabric Runtime Configurations for EFA

⚠️ **GitHub.com Fallback** ⚠️

⚠️ GitHub.com Fallback ⚠️