-
2.21.5-1
released this
2024-04-02 16:53:21 +08:00 | 20 commits to master since this releaseAdd support for IB SHARP 1PPN operation with user buffers.
Improve support for MNNVL, add NVLS support and multi-clique support.- Detect the NVLS clique through NVML
- Exchange XML between peers in the same NVLS clique and fuse XMLs
before creating the topology graph. - Rework bootstrap allgather algorithms to allow for large allgather
operations intra-node (XML exchange).
Net/IB: add support for dynamic GID detection. - Automatically select RoCEv2/IPv4 interface by default. Allow to
select IPv6 or even the network/mask.
Reduce NVLS memory usage. - Add stepSize as property of a connection to allow for different
sizes on different peers; set it to 128K for NVLink SHARP.
Improve tuner loading - Look for more paths, be more consistent with the network device
plugin. - Also search for tuner support inside the net plugin.
Improve tuner API - Add context to support multi-device per process.
Add magic number around comm object to detect comm corruption. - Add some basic check around communicators so that we can report a
problem when a communicator gets corrupted or a wrong comm pointer
is passed to NCCL.
Fix net/IB error path. Github PR #1164
Fix collnet rail mapping with split comm.
Fix packet reordering issue causing bootstrap mismatch - Use a different tag in ncclTransportP2pSetup for the connectInfo
exchange and the following barrier.
Fix hang when crossNic is inconsistent between ranks.
Fix minCompCap/maxCompCap computation. Github issue #1184
Downloads