• 2.22.3-1

    Ghost released this 2024-06-19 16:57:16 +08:00 | 18 commits to master since this release

    Rework core for NVIDIA Trusted Computing

    • Compress work structs so that they are shared between channels
    • Utilize the full amount of kernel argument space permitted (4k)
      before resorting to work fifo.
    • Rework the task preprocessing phase.
    • Use a separate abortDevFlag which is kept in sync with abortFlag
      using cudaMemcpy operations.
    • Rename src/include/align.h to src/include/bitops.h

    Add lazy connection establishment for collective operations

    • Move buffer allocation and connection establishment to the first
      collective operation using that algorithm.
    • Accelerate init time and reduce memory usage.
    • Avoid allocating NVLS buffers if all calls are registered.
    • Compute algo/proto in ncclLaunchCollTasksInfo early on.
    • Connect peers in ncclCollPreconnectFunc if not connected already.
    • Also move shared buffer creation to the first send/recv call.

    Accelerate intra-node NVLink detection

    • Make each rank only detect NVLinks attached to its GPU.
    • Fuse XMLs to reconstruct the full NVLink topology

    Add init profiling to report time spend in different init phases.

    • Report timings of bootstrap, allgather, search, connect, etc.
    • Add new "PROFILE" category for NCCL_DEBUG_SUBSYS.

    Add support for PCI p2p on split PCI switches

    • Detect split PCI switches through a kernel module exposing
      switch information.
    • Update the topology XML and graph to add those inter-switch
      connections.

    Add cost estimation API

    • Add a new ncclGroupEndSimulate primitive to return the estimated
      time a group would take.

    Net/IB: Add separate traffic class for fifo messages

    • Add NCCL_IB_FIFO_TC to control the traffic class of fifo messages
      independently from NCCL_IB_TC.
      Merges PR #1194

    Net/IB: Add support for IB router

    • Use flid instead of lid if subnets do not match
    • Warn if flid is 0

    Optimizations and fixes for device network offload (unpack)

    • Double the default number of channels
    • Cache netDeviceType
    • Fix save/increment head logic to enable Tree support.

    Support ncclGroupStart/End for ncclCommAbort/Destroy

    • Allow Abort/Destroy to be called within a group when managing
      multiple GPUs with a single process.

    Improve Tuner API

    • Provide to the plugin the original cost table so that the plugin
      can leave unknown or disabled algo/proto combinations untouched.
    • Remove nvlsSupport and collnetSupport.

    Do not print version to stdout when using a debug file

    • Also print version from all processes with INFO debug level.
      Fixes issue #1271

    Fix clang warnings in NVTX headers

    • Update NVTX headers to the latest version
      Fixes issue #1270

    Disable port fusion in heterogeneous systems

    • Do not fuse ports if a mix of multi-port and single port are detected.

    Fix NVLS graphs search for dual NICs.

    • Fix NVLS graph search when we have more than one NIC per GPU.

    Fix crash with collnetDirect

    • Add separate graph search for collnetDirect, testing alltoall paths
      and working similarly to the NVLS search.

    Fix hang when nodes have different CPU types

    • Add the CPU type to the rank peer info.
    • Align all ranks on the CPU type after the first allgather.
    • Only use the aligned CPU type for all tuning operations.
      Fixes issue #1136
      Fixes issue #1184

    Fix performance of registered send/recv operations

    • Allow for single full size operations
    • Add INFO to confirm the registration of send/recv buffers.

    Move all sync ops to finalize stage

    • Ensure ncclCommDestroy is non-blocking if ncclCommFinalize has
      been called.

    Improve error reporting during SHM segment creation

    Improve support of various compilers
    Merges PR #1177
    Merges PR #1228

    Allow net and tuner plugins to be statically linked

    • Search for ncclNet or ncclTuner symbols in the main binary.
      Merges PR #979

    Plugin examples includes cleanup

    • Harmonize err.h and common.h usage.
    • Add mixed plugin with both net and tuner.
    Downloads