• 2.24.3-1

    Ghost released this 2025-01-07 18:01:15 +08:00 | 13 commits to master since this release

    Network user buffer support for collectives

    • Leverage user buffer registration to achieve zero-copy
      inter-node communications for Ring, NVLS and Collnet

    Add RAS subsystem

    • Create a RAS thread keeping track of all NCCL communicators.
    • Add a ncclras tool contacting the RAS thread and getting a
      report.

    Add fp8 support

    • Add support for e5m2 and e4m3 8-bit floating point operations.
    • Use Tree/PAT algorithms when possible for better numerical
      stability.

    Add NIC fusion

    • Add a NET API to ask the network plugin to fuse a set of
      interfaces together.
    • Fuse multiple NICs under the same PCI switch as a single,
      larger NIC.

    Socket connection failure retry

    • Retry in case of socket connection failure (unreachable host)
    • Avoid "Software caused connection abort" errors on retries

    QP connection failure retry

    • Retry in case of IB QP connection failure during ibv_modify_qp.

    NET API improvements

    • Allow plugins to force a flush in case data and completion
      ordering is not guaranteed.
    • Indicate when completion is not needed (e.g. for the LL128
      protocol), allowing plugins to skip generating a completion.
    • Allow for full offload of allgather operations when using one
      GPU per node.

    NCCL_ALGO/NCCL_PROTO strict enforcement

    • Extend NCCL_ALGO/NCCL_PROTO syntax to be able to specify
      ALGO/PROTO filters for each collective operation.
    • Strictly enforce the ALGO/PROTO filters, no longer fall back
      on the ring algorithm when the filtering leaves no option and
      error out instead.

    Enable CUMEM host allocations

    • Use cumem functions for host memory allocation by default.

    Improved profiler plugin API

    • Avoid dependencies with NCCL includes.
    • Add information on whether the buffer is registered or not

    Adjust PAT tuning

    • Improve transition between PAT and ring at scale.

    Fix hangs when running with different CPU architectures

    • Detect when we use a mix of GPU architectures
    • Ensure Algo/Proto decisions are made based on that unified
      state.

    Fix FD leak in UDS

    • Fix a leak when mapping buffers intra-node with cumem IPCs.

    Fix crash when mixing buffer registration and graph buffer registration.

    • Separate local and graph registration to avoid crashes when we free
      buffers.

    Fix user buffer registration with dmabuf

    • Make ncclSend/ncclRecv communication with buffer registration functional
      on network plugins relying on dmabuf for buffer registration.

    Fix crash in IB code caused by uninitialized fields.

    Fix non-blocking ncclSend/ncclRecv

    • Fix case where ncclSend/ncclRecv would return ncclSuccess in non-blocking
      mode even though the operation was not enqueued onto the stream.
    • Issue #1495

    Various compiler tweaks and fixes

    Fix typo in ncclTopoPrintGraph

    Downloads