Kamil Iskra f44ac759fe NCCL 2.26.2-1
Profiler improvements
 * Add events for CUDA kernel start and end.
 * Allow network plugins to generate profiling events
 * Enable profiling on a per-operation basis, rather than per-communicator.
 * Add support for graph capturing.

Add implicit launch order
 * Allow to prevent deadlocks when using multiple NCCL communicators per
   device by implicitly ordering NCCL operations using the host program
   order. Disabled by default, set NCCL_LAUNCH_ORDER_IMPLICIT=1 to enable.
 * Add a complementary mechanism to detect host threads racing to launch
   to the same device. Enabled by default, set NCCL_LAUNCH_RACE_FATAL=0 to
   disable.

Optimize the PAT algorithm
 * Separate the computation and execution of PAT steps on different warps,
   allowing to run up to 16 PAT steps in parallel to significantly
   accelerate PAT and reduce its linear part.

Add support for setting QoS per communicator
 * Add a new trafficClass field to the communicator configuration, to
   allow the application to select a particular traffic class for a
   given communicator. The meaning of the traffic class is
   network-specific and should be set in accordance with the network
   configuration.
 * For the IB/RoCE plugin, existing config variables such as NCCL_IB_SL
   and NCCL_IB_TC take precedence.

Allow to enable GPU Direct RDMA specifically on C2C platforms
 * Disabled by default, set NCCL_NET_GDR_C2C=1 to enable.

Do not disable user buffer registration unless PXN is really used
 * Only disable UB when a communicator has more than one rank per
   node on any node.

RAS subsystem improvements
 * Report operation counts separately for each collective operation type.
 * Provide details about missing communicator ranks and reliably
   distinguish ranks that are no longer a given communicator's members
   (now reported as NOCOMM) from those that failed to respond.

Add support for timestamps to NCCL diagnostic messages
 * On by default for WARN messages; NCCL_DEBUG_TIMESTAMP_LEVELS can be
   used to enable them for other debug levels as well.
 * The format can be changed using the NCCL_DEBUG_TIMESTAMP_FORMAT config
   variable.

Reduce the memory usage with NVLink SHARP (NVLS)
 * Potentially save hundreds of MBs of device memory, considering the
   multicast buffer size granularity separately from the address alignment.

Update performance tuning for recent Intel CPUs
 * Improve algorithm/protocol selection on recent CPUs such as Emerald
   Rapids and Sapphire Rapids.

Improve channel scheduling when mixing LL and Simple operations.
 * Make LL operations account for 4x more traffic to ensure LL and simple
   operations complete at the same time.

Refactor the plugin code
 * Clean up and harmonize the support code across the network, tuner,
   and profiler plugins.

Add support for comment lines (starting with #) in the nccl.conf file
* Issue #1540.

Make user buffer registration problems print an INFO instead of a WARN.

Drop support for network plugin interface version 5.

Fix a race condition with split-shared communicators
 * NCCL could hang during connection setup if multiple communicators
   were grouped together that share resources.

Fix a performance regression when using NCCL_CROSS_NIC=1
 * NCCL would unnecessarily alternate rings, breaking the GPU-NIC
   associations.

Make GID index detection code more resilient
 * Dynamic GID detection code was giving up too soon if the
   detected index was not available (e.g., wasn't mapped to the
   container's sysfs).
 * Issues #1538, #1573.

Fix a race condition with non-blocking operation
 * Fix issue when creating a non-blocking communicator after a non-
   blocking collective operation on another communicator.

Fix shared memory usage on recent Blackwell GPUs.
 * Issues NVIDIA/nccl-tests#287, NVIDIA/nccl-tests#291, #1637.

Fix an error with NIC fusion and IB SHARP when recreating communicators
 * Disable the unloading of network plugins

Make the auto-merge failures in the NIC fusion non-fatal
 * This could happen when trying to merge IB and RoCE devices.

Fixes to ncclCommAbort
 * Fix hangs due to the progress thread spinning indefinitely on the
   network progress.
 * Reduce the abort time by up to two orders of magnitude.

Fix a crash when libnccl.so was dynamically unloaded
 * The RAS subsystem was missing a clean-up handler.

Fix a hang if the network plugin's test() call returns an error.

Fix a hang on heterogeneous architectures
 * Ensure we harmonize the tuning to avoid different tuning choices,
   causing a hang.

Fix double-free on failed ncclCommInitRank and ncclCommFinalize.

Fix a potential list traversal bug during a group launch of multiple
communicators
 * Issue #1599.

Unify the handling of NCCL configuration variables
 * Under rare circumstances, some variables specified in the config file
   could be ignored.
2025-03-12 13:46:21 -07:00

153 lines
8.4 KiB
C

/*************************************************************************
* Copyright (c) 2017-2022, NVIDIA CORPORATION. All rights reserved.
*
* See LICENSE.txt for license information
************************************************************************/
#ifndef NET_V9_H_
#define NET_V9_H_
#define NCCL_NET_MAX_DEVS_PER_NIC_V9 4
typedef struct {
int ndevs;
int devs[NCCL_NET_MAX_DEVS_PER_NIC_V9];
} ncclNetVDeviceProps_v9_t;
typedef struct {
char* name; // Used mostly for logging.
char* pciPath; // Path to the PCI device in /sys.
uint64_t guid; // Unique identifier for the NIC chip. Important for
// cards with multiple PCI functions (Physical or virtual).
int ptrSupport; // [NCCL_PTR_HOST|NCCL_PTR_CUDA|NCCL_PTR_DMABUF]
int regIsGlobal; // regMr is not tied to a particular comm
int forceFlush; // Force a flush on receives
int speed; // Port speed in Mbps.
int port; // Port number.
float latency; // Network latency
int maxComms; // Maximum number of comms we can create
int maxRecvs; // Maximum number of grouped receives.
ncclNetDeviceType netDeviceType; // Network offload type
int netDeviceVersion; // Version number for network offload
ncclNetVDeviceProps_v9_t vProps;
size_t maxP2pBytes; // Max transfer size for point-to-point operations
size_t maxCollBytes; // Max transfer size for collective operations
} ncclNetProperties_v9_t;
typedef struct {
// Name of the network (mainly for logs)
const char* name;
// Initialize the network.
ncclResult_t (*init)(ncclDebugLogger_t logFunction);
// Return the number of adapters.
ncclResult_t (*devices)(int* ndev);
// Get various device properties.
ncclResult_t (*getProperties)(int dev, ncclNetProperties_v9_t* props);
// Create a receiving object and provide a handle to connect to it. The
// handle can be up to NCCL_NET_HANDLE_MAXSIZE bytes and will be exchanged
// between ranks to create a connection.
ncclResult_t (*listen)(int dev, void* handle, void** listenComm);
// Connect to a handle and return a sending comm object for that peer.
// This call must not block for the connection to be established, and instead
// should return successfully with sendComm == NULL with the expectation that
// it will be called again until sendComm != NULL.
// If *sendDevComm points to a valid object, then NCCL is requesting device offload for this connection
ncclResult_t (*connect)(int dev, void* handle, void** sendComm, ncclNetDeviceHandle_v9_t** sendDevComm);
// Finalize connection establishment after remote peer has called connect.
// This call must not block for the connection to be established, and instead
// should return successfully with recvComm == NULL with the expectation that
// it will be called again until recvComm != NULL.
// If *recvDevComm points to a valid object, then NCCL is requesting device offload for this connection
ncclResult_t (*accept)(void* listenComm, void** recvComm, ncclNetDeviceHandle_v9_t** recvDevComm);
// Register/Deregister memory. Comm can be either a sendComm or a recvComm.
// Type is either NCCL_PTR_HOST or NCCL_PTR_CUDA.
ncclResult_t (*regMr)(void* comm, void* data, size_t size, int type, void** mhandle);
/* DMA-BUF support */
ncclResult_t (*regMrDmaBuf)(void* comm, void* data, size_t size, int type, uint64_t offset, int fd, void** mhandle);
ncclResult_t (*deregMr)(void* comm, void* mhandle);
// Asynchronous send to a peer.
// May return request == NULL if the call cannot be performed (or would block)
ncclResult_t (*isend)(void* sendComm, void* data, size_t size, int tag, void* mhandle, void** request);
// Asynchronous recv from a peer.
// May return request == NULL if the call cannot be performed (or would block)
ncclResult_t (*irecv)(void* recvComm, int n, void** data, size_t* sizes, int* tags, void** mhandles, void** request);
// Perform a flush/fence to make sure all data received with NCCL_PTR_CUDA is
// visible to the GPU
ncclResult_t (*iflush)(void* recvComm, int n, void** data, int* sizes, void** mhandles, void** request);
// Test whether a request is complete. If size is not NULL, it returns the
// number of bytes sent/received.
ncclResult_t (*test)(void* request, int* done, int* sizes);
// Close and free send/recv comm objects
ncclResult_t (*closeSend)(void* sendComm);
ncclResult_t (*closeRecv)(void* recvComm);
ncclResult_t (*closeListen)(void* listenComm);
// Copy the given mhandle to a dptr in a format usable by this plugin's device code
ncclResult_t (*getDeviceMr)(void* comm, void* mhandle, void** dptr_mhandle);
// Notify the plugin that a recv has completed by the device
ncclResult_t (*irecvConsumed)(void* recvComm, int n, void* request);
// Virtual NIC APIs. makeVDevice will create a virtual NIC given the specified properties, and tell the caller
// what index this new vNIC exists at
ncclResult_t (*makeVDevice)(int* d, ncclNetVDeviceProps_v9_t* props);
} ncclNet_v9_t;
typedef struct {
void* mhandle;
void* address;
size_t size;
} ncclNetSGE_v9_t;
typedef struct {
// Name of the collective network (mainly for logs)
const char* name;
// Initialize the collective network.
ncclResult_t (*init)(ncclDebugLogger_t logFunction);
// Return the number of adapters capable of doing collective operations.
// If ndev returns 0, all other functions might be set to NULL.
ncclResult_t (*devices)(int* ndev);
// Get various device properties.
ncclResult_t (*getProperties)(int dev, ncclNetProperties_v9_t* props);
// Create a receiving object and provide a handle to connect to it. The
// handle can be up to NCCL_NET_HANDLE_MAXSIZE bytes and will be exchanged
// between ranks to create connections.
ncclResult_t (*listen)(int dev, void* handle, void** listenComm);
// Create a group for collective operations. handles have been created
// using listen() above. rank indicates caller's rank in the collective network.
ncclResult_t (*connect)(void* handles[], int nranks, int rank, void* listenComm, void** collComm);
// Returns whether a reduction operation on a data type is supported.
// 1 for supported, 0 otherwise.
ncclResult_t (*reduceSupport)(ncclDataType_t dataType, ncclRedOp_t redOp, int* supported);
// Register/Deregister memory. Type is either NCCL_PTR_HOST or NCCL_PTR_CUDA.
ncclResult_t (*regMr)(void* collComm, void* data, size_t size, int type, void** mhandle);
/* DMA-BUF support */
ncclResult_t (*regMrDmaBuf)(void* collComm, void* data, size_t size, int type, uint64_t offset, int fd, void** mhandle);
ncclResult_t (*deregMr)(void* collComm, void* mhandle);
// Performs an asynchronous allreduce operation on the collective group.
// May return request == NULL if the call cannot be performed (or would block).
ncclResult_t (*iallreduce)(void* collComm, void* sendData, void* recvData, size_t count,
ncclDataType_t dataType, ncclRedOp_t redOp, void* sendMhandle, void* recvMhandle, void** request);
ncclResult_t (*iallgather)(void* collComm, void* sendData, int nRecvParts, ncclNetSGE_v9_t* recvParts,
size_t bytesPerRank, size_t windowOffset, size_t windowBytes,
void* sendMhandle, void** request);
ncclResult_t (*ireducescatter)(void* collComm, int nSendParts, ncclNetSGE_v9_t* sendParts, void* recvData,
size_t bytesPerRank, size_t windowOffset, size_t windowBytes,
ncclDataType_t dataType, ncclRedOp_t redOp,
void* recvMhandle, void** request);
// Perform a flush/fence to make sure all data received with NCCL_PTR_CUDA is
// visible to the GPU
ncclResult_t (*iflush)(void* collComm, void* data, int size, void* mhandle, void** request);
// Test whether a request is complete. If size is not NULL, it returns the
// number of bytes sent/received.
ncclResult_t (*test)(void* request, int* done, int* size);
// Close and free collective comm objects
ncclResult_t (*closeColl)(void* collComm);
ncclResult_t (*closeListen)(void* listenComm);
// Create a virtual NIC given the specified properties, which can be accessed at device index d
ncclResult_t (*makeVDevice)(int* d, ncclNetVDeviceProps_v9_t* props);
} ncclCollNet_v9_t;
#endif // end include guard