• 2.21.5-1

    Ghost released this 2024-04-02 16:53:21 +08:00 | 20 commits to master since this release

    Add support for IB SHARP 1PPN operation with user buffers.
    Improve support for MNNVL, add NVLS support and multi-clique support.

    • Detect the NVLS clique through NVML
    • Exchange XML between peers in the same NVLS clique and fuse XMLs
      before creating the topology graph.
    • Rework bootstrap allgather algorithms to allow for large allgather
      operations intra-node (XML exchange).
      Net/IB: add support for dynamic GID detection.
    • Automatically select RoCEv2/IPv4 interface by default. Allow to
      select IPv6 or even the network/mask.
      Reduce NVLS memory usage.
    • Add stepSize as property of a connection to allow for different
      sizes on different peers; set it to 128K for NVLink SHARP.
      Improve tuner loading
    • Look for more paths, be more consistent with the network device
      plugin.
    • Also search for tuner support inside the net plugin.
      Improve tuner API
    • Add context to support multi-device per process.
      Add magic number around comm object to detect comm corruption.
    • Add some basic check around communicators so that we can report a
      problem when a communicator gets corrupted or a wrong comm pointer
      is passed to NCCL.
      Fix net/IB error path. Github PR #1164
      Fix collnet rail mapping with split comm.
      Fix packet reordering issue causing bootstrap mismatch
    • Use a different tag in ncclTransportP2pSetup for the connectInfo
      exchange and the following barrier.
      Fix hang when crossNic is inconsistent between ranks.
      Fix minCompCap/maxCompCap computation. Github issue #1184
    Downloads