-
2.24.3-1
released this
2025-01-07 18:01:15 +08:00 | 13 commits to master since this releaseNetwork user buffer support for collectives
- Leverage user buffer registration to achieve zero-copy
inter-node communications for Ring, NVLS and Collnet
Add RAS subsystem
- Create a RAS thread keeping track of all NCCL communicators.
- Add a ncclras tool contacting the RAS thread and getting a
report.
Add fp8 support
- Add support for e5m2 and e4m3 8-bit floating point operations.
- Use Tree/PAT algorithms when possible for better numerical
stability.
Add NIC fusion
- Add a NET API to ask the network plugin to fuse a set of
interfaces together. - Fuse multiple NICs under the same PCI switch as a single,
larger NIC.
Socket connection failure retry
- Retry in case of socket connection failure (unreachable host)
- Avoid "Software caused connection abort" errors on retries
QP connection failure retry
- Retry in case of IB QP connection failure during ibv_modify_qp.
NET API improvements
- Allow plugins to force a flush in case data and completion
ordering is not guaranteed. - Indicate when completion is not needed (e.g. for the LL128
protocol), allowing plugins to skip generating a completion. - Allow for full offload of allgather operations when using one
GPU per node.
NCCL_ALGO/NCCL_PROTO strict enforcement
- Extend NCCL_ALGO/NCCL_PROTO syntax to be able to specify
ALGO/PROTO filters for each collective operation. - Strictly enforce the ALGO/PROTO filters, no longer fall back
on the ring algorithm when the filtering leaves no option and
error out instead.
Enable CUMEM host allocations
- Use cumem functions for host memory allocation by default.
Improved profiler plugin API
- Avoid dependencies with NCCL includes.
- Add information on whether the buffer is registered or not
Adjust PAT tuning
- Improve transition between PAT and ring at scale.
Fix hangs when running with different CPU architectures
- Detect when we use a mix of GPU architectures
- Ensure Algo/Proto decisions are made based on that unified
state.
Fix FD leak in UDS
- Fix a leak when mapping buffers intra-node with cumem IPCs.
Fix crash when mixing buffer registration and graph buffer registration.
- Separate local and graph registration to avoid crashes when we free
buffers.
Fix user buffer registration with dmabuf
- Make ncclSend/ncclRecv communication with buffer registration functional
on network plugins relying on dmabuf for buffer registration.
Fix crash in IB code caused by uninitialized fields.
Fix non-blocking ncclSend/ncclRecv
- Fix case where ncclSend/ncclRecv would return ncclSuccess in non-blocking
mode even though the operation was not enqueued onto the stream. - Issue #1495
Various compiler tweaks and fixes
- PR #758
Fix typo in ncclTopoPrintGraph
- Issue #1468
Downloads
- Leverage user buffer registration to achieve zero-copy