-
2.23.4-1
released this
2024-09-17 14:41:17 +08:00 | 17 commits to master since this releaseAdd scalable init API
- Add new ncclCommInitRankScalable to allow for passing multiple
unique IDs to the init function. - Spreads the load onto multiple bootstrap roots, allowing for
constant bootstrap time. - Requires multiple ranks to create a unique ID, and the CPU-side
ID exchange code to call allgather[v] instead of broadcast.
Accelerate init bootstrap operations
- Reduce the number of calls to allgather.
- Allow roots to reply early to ranks when information is already
available. - Add an option to use ncclNet instead of sockets to perform
bootstrap allgather operations.
Add PAT algorithms for Allgather and ReduceScatter
- Parallel Aggregated Trees, variation of Bruck algorithm.
- Logarithmic number of network steps for small sizes at scale.
- Only supports one rank per node at the moment.
Add support for registered buffers for intra-node communication.
- Allow registered user buffers to be accessed directly intra-node
- Avoids extra copies in algorithms which permit it, saving
memory bandwidth and helping with compute overlap.
Add profiler plugin API
- New plugin API for profiling
- Supports various levels of profiling, with a hierarchy.
Asynchronous graph allocation
- Make calls to cudaMalloc and cudaMemcpy during graph allocation
asynchronous. - Significantly speeds up graph capture.
Use fatal IB asynchronous events to stop network operation
- Avoids many other error messages
- Only fatal errors are affected; potentially transient errors
(e.g. port down) do not cause an immediate stop.
Set P2P level to PXB on AMD CPUs when using more than 2 GPUs per node
- P2P would cause a significant performance degradation when using
many GPUs, and therefore many interleaved data flows. - Disable P2P through the CPU when we have 3+ GPUs per node; keep it
enabled when we only have 2 GPUs.
Improve the init logs to report the real NCCL function.
- Make the log report ncclCommInitRank or ncclCommSplit, rather than
the generic ncclCommInitRankFunc.
Add a parameter to set the location of the user configuration file.
- Add NCCL_CONF_FILE environment variable to set where the user's
configuration file resides.
Increase default IB timeout
- Increase IB timeout value from 18 to 20.
- Should help avoid fatal errors on large RoCE systems.
Add new check for nvidia peermem
- On linux kernels 6.6+, /sys/kernel/mm/memory_peers is no longer
present; check for /sys/module/nvidia_peermem/version instead.
Fix old performance regression when mixing small and large operations.
- Improves distribution of work on channels.
Fix crash when NUMA IDs are equal to -1.
- Can happen when a NIC is a virtual NIC, or when linux doesn't
know which NUMA node a device is attached to - Issue NVIDIA/nccl-tests#233
Fix tree graph search when NCCL_CROSS_NIC is set to 1.
- Would force NCCL to use the balanced_tree pattern, thereby
disabling LL128 on platforms with 1 GPU+1 NIC per PCI switch. - Would also try to use alternate rings even though it was not
needed.
Compiler tweaks and fixes
Fix stack smash
- PR #1325
Fixes for multi-node NVLink + IB operation
Coverity fixes and comments.
Downloads
- Add new ncclCommInitRankScalable to allow for passing multiple