v2.23.4-1 - nccl - ZhengmaoYe

yezhengmao/nccl

Fork 0

RSS Feed

v2.23.4-1 68b542363f
2.23.4-1

Ghost released this 2024-09-17 14:41:17 +08:00 | 17 commits to master since this release
Add scalable init API
- Add new ncclCommInitRankScalable to allow for passing multiple
  unique IDs to the init function.
- Spreads the load onto multiple bootstrap roots, allowing for
  constant bootstrap time.
- Requires multiple ranks to create a unique ID, and the CPU-side
  ID exchange code to call allgather[v] instead of broadcast.
Accelerate init bootstrap operations
- Reduce the number of calls to allgather.
- Allow roots to reply early to ranks when information is already
  available.
- Add an option to use ncclNet instead of sockets to perform
  bootstrap allgather operations.
Add PAT algorithms for Allgather and ReduceScatter
- Parallel Aggregated Trees, variation of Bruck algorithm.
- Logarithmic number of network steps for small sizes at scale.
- Only supports one rank per node at the moment.
Add support for registered buffers for intra-node communication.
- Allow registered user buffers to be accessed directly intra-node
- Avoids extra copies in algorithms which permit it, saving
  memory bandwidth and helping with compute overlap.
Add profiler plugin API
- New plugin API for profiling
- Supports various levels of profiling, with a hierarchy.
Asynchronous graph allocation
- Make calls to cudaMalloc and cudaMemcpy during graph allocation
  asynchronous.
- Significantly speeds up graph capture.
Use fatal IB asynchronous events to stop network operation
- Avoids many other error messages
- Only fatal errors are affected; potentially transient errors
  (e.g. port down) do not cause an immediate stop.
Set P2P level to PXB on AMD CPUs when using more than 2 GPUs per node
- P2P would cause a significant performance degradation when using
  many GPUs, and therefore many interleaved data flows.
- Disable P2P through the CPU when we have 3+ GPUs per node; keep it
  enabled when we only have 2 GPUs.
Improve the init logs to report the real NCCL function.
- Make the log report ncclCommInitRank or ncclCommSplit, rather than
  the generic ncclCommInitRankFunc.
Add a parameter to set the location of the user configuration file.
- Add NCCL_CONF_FILE environment variable to set where the user's
  configuration file resides.
Increase default IB timeout
- Increase IB timeout value from 18 to 20.
- Should help avoid fatal errors on large RoCE systems.
Add new check for nvidia peermem
- On linux kernels 6.6+, /sys/kernel/mm/memory_peers is no longer
  present; check for /sys/module/nvidia_peermem/version instead.
Fix old performance regression when mixing small and large operations.
- Improves distribution of work on channels.
Fix crash when NUMA IDs are equal to -1.
- Can happen when a NIC is a virtual NIC, or when linux doesn't
  know which NUMA node a device is attached to
- Issue NVIDIA/nccl-tests#233
Fix tree graph search when NCCL_CROSS_NIC is set to 1.
- Would force NCCL to use the balanced_tree pattern, thereby
  disabling LL128 on platforms with 1 GPU+1 NIC per PCI switch.
- Would also try to use alternate rings even though it was not
  needed.
Compiler tweaks and fixes
- PR #1177
- PR #1228
Fix stack smash
- PR #1325
Fixes for multi-node NVLink + IB operation

Coverity fixes and comments.
Downloads
- Source Code (ZIP)
- Source Code (TAR.GZ)

0 Releases 63 Tags

2.23.4-1