OFI_NCCL_USE_IPV6_TCP |
Allow using endpoints with IPv6 addressing format for TCP provider. Users can specify to use a preferred libfabric provider with `FI_PROVIDER` environment variable. |
Boolean |
0/1 (Default: 0) |
OFI_NCCL_EXCLUDE_TCP_IF |
List of interface names to be filtered out for TCP provider. Users can specify to use a preferred libfabric provider with `FI_PROVIDER` environment variable. |
String |
Comma-separated list of interface names (Default: "lo,docker0") |
OFI_NCCL_GDR_FLUSH_DISABLE |
Disable flush operation when using GPUDirect. |
Boolean |
0/1 (Default: 0) |
OFI_NCCL_DISABLE_DMABUF |
Disable DMABUF Support |
Boolean |
0/1 (Default 0) |
OFI_NCCL_NIC_DUP_CONNS |
Set number of NIC connections. This is used to increase hardware utilization. Applicable for P3Dn when using less number of GPUs than 8.. |
Integer |
x, to set x number of connections. Only overridden for greater than 0 values (Default: 0) |
OFI_NCCL_CUDA_FLUSH_ENABLE |
When using GPUDirect use the cudaDeviceFlushGPUDirectRDMAWrites to
enforce data consistency at the receiving GPU. Requires CUDA 11.3 or
later. Note that this function only provides a GPU memory fence and
requires that data has already been delivered to GPU memory. Some
networks and PCIe configurations require an additional network-level
flush that is not provided by this option. |
Boolean |
0/1 (Default: 0) |
OFI_NCCL_CQ_READ_COUNT |
Adjust the maximum number of completion entries that will
be read in a single Libfabric polling loop. In general, users
should not have to adjust this value. An array of completion
queue entry structures is created on the stack, so large (over
16-32) values of this parameter may cause stack overflows. |
Integer |
Default: 4 |
OFI_NCCL_PROTOCOL |
Protocol to use for implementing send/recv operations.
Default is `SENDRECV`, which uses the Libfabric tagged send/recv
interface. This implementation will give the best performance
on hardware that implements tagged sends natively, and likely
most Libfabric implementations that include an eager send
optimization for GPU buffers. The other valid option is `RDMA`,
which implements a sender-managed receive queue using RDMA write
operations and supports multi-rail channels per GPU. The `RDMA`
protocol is likely to work better than `SENDRECV` on networks that
do not have an eager optimization or that have multiple NICs per
GPU. |
String |
Default: SENDRECV |
OFI_NCCL_MIN_STRIPE_SIZE |
Adjust the maximum size of `RDMA` protocol messages that are
assigned to multi-rail channels in round-robin mode. Messages larger
than the threshold are multiplexed over all channels to increase
network throughput. In general, users should not have to adjust this
value. A very small threshold may cause the `RDMA` protocol
initialization fail since RDMA protocol control messages
shall not be multiplexed. |
Integer |
Default: 128KiB |
OFI_NCCL_NET_LATENCY |
Internode network latency in us reported to NCCL. |
Integer |
Any non-negative integer. Defaults to 0, unless the configured
platform sets a specific value.
|
OFI_NCCL_EAGER_MAX_SIZE |
Eager message size limit when using RDMA protocol. Message sizes greater than
this limit will always be sent using RDMA write instead of eagerly. By default, eager mode is enabled |
Integer |
Default: 8192, indicating eager mode is enabled. Setting to any non-negative integer <= ROUND_ROBIN_THRESHOLD will limit eager mode to only messages with size less than OFI_NCCL_EAGER_MAX_SIZE. Setting to -1 will disable eager mode.
|
OFI_NCCL_DISABLE_GDR_REQUIRED_CHECK |
Disable the check for required GDR support on EC2 instances. When this check
is disabled, the plugin can be used without GDR support even on platforms
that support GDR (P4d and later). By default, the plugin performs the check. |
Boolean |
0/1 (Default: 0) |
OFI_NCCL_MR_KEY_SIZE |
Specify the memory registration key size in bytes when using a libfabric provider
that supports application-selected memory registration keys. |
Integer |
Default: 2 |
OFI_NCCL_MR_CACHE_DISABLE |
Disable the MR cache. The MR cache is used to keep track of registered memory
regions, so that calling regMr() on the same buffer (address and size), will quickly
return a previously globally registered MR on that buffer, avoiding redundant (and
expensive) registrations with the underlying device. Disabling the MR cache will make
all calls to regMR() result in a registration with the device, so it may cause a
significant performance degradation. |
Boolean |
0/1 (Default: 0) |
OFI_NCCL_DOMAIN_PER_THREAD |
By default, the plugin creates one Libfabric domain per process. On AWS Tranium
instances, it creates one domain per thread instead. This variable can override the
default behavior. |
Integer |
default:-1 (unset default): use the platform-specific configuration.
0: Allocate one domain per process1: Allocate one domain per thread
|
OFI_NCCL_DISABLE_NATIVE_RDMA_CHECK |
On AWS platforms the plugin checks for native RDMA write support when using the
RDMA protocol. This variable can disable this check to allow using the RDMA protocol
even on platforms where native RDMA write is not supported (or cannot be verified
to be supported). |
Boolean |
0/1 (Default: 0) |
OFI_NCCL_DISABLE_GDR_REQUIRED_CHECK |
Disable the check for required GDR support on AWS instances. When this check is
disabled, the plugin can be used without GDR support even on platforms that support
GDR (P4d and later). |
Boolean |
0/1 (Default: 0) |
OFI_NCCL_RDMA_MIN_POSTED_BOUNCE_BUFFERS |
Minimum bounce buffers posted per endpoint. The plugin will attempt to post more
bounce buffers if we dip below this threshold, allocating new bounce buffers if
needed. |
Integer |
Default: 64 |
OFI_NCCL_RDMA_MAX_POSTED_BOUNCE_BUFFERS |
Maximum bounce buffers posted per endpoint. The plugin will not attempt to post
more bounce buffers if it reaches this threshold. |
Integer |
Default: 128 |
OFI_NCCL_ERRORCHECK_MUTEX |
If non-zero, fail if a thread attempts to relock a mutex that it has already
locked (used for debug). |
Boolean |
Default:1 if debugging is enabled,0 otherwise |
OFI_NCCL_ENDPOINT_PER_COMM |
If zero, create a Libfabric endpoint per domain, shared across all communicators.
If non-zero, create different endpoints for receive communicators connected to the
same source endpoint, while using a shared completion queue. |
Boolean |
0/1 (Default: 0) |
OFI_NCCL_EARLY_COMPLETION |
If zero, disable receive request early completion.
If non-zero, receiver side marks request completion immediately after CTRL message send completion and does not wait for RDMA write operation completion By default, it depends on provider's data progress model, if FI_PROGRESS_AUTO, early completion is enabled. Otherwise, disabled |
Integer |
Default: -1 |