• 2.23.4-1

    Ghost released this 2024-09-17 14:41:17 +08:00 | 17 commits to master since this release

    Add scalable init API

    • Add new ncclCommInitRankScalable to allow for passing multiple
      unique IDs to the init function.
    • Spreads the load onto multiple bootstrap roots, allowing for
      constant bootstrap time.
    • Requires multiple ranks to create a unique ID, and the CPU-side
      ID exchange code to call allgather[v] instead of broadcast.

    Accelerate init bootstrap operations

    • Reduce the number of calls to allgather.
    • Allow roots to reply early to ranks when information is already
      available.
    • Add an option to use ncclNet instead of sockets to perform
      bootstrap allgather operations.

    Add PAT algorithms for Allgather and ReduceScatter

    • Parallel Aggregated Trees, variation of Bruck algorithm.
    • Logarithmic number of network steps for small sizes at scale.
    • Only supports one rank per node at the moment.

    Add support for registered buffers for intra-node communication.

    • Allow registered user buffers to be accessed directly intra-node
    • Avoids extra copies in algorithms which permit it, saving
      memory bandwidth and helping with compute overlap.

    Add profiler plugin API

    • New plugin API for profiling
    • Supports various levels of profiling, with a hierarchy.

    Asynchronous graph allocation

    • Make calls to cudaMalloc and cudaMemcpy during graph allocation
      asynchronous.
    • Significantly speeds up graph capture.

    Use fatal IB asynchronous events to stop network operation

    • Avoids many other error messages
    • Only fatal errors are affected; potentially transient errors
      (e.g. port down) do not cause an immediate stop.

    Set P2P level to PXB on AMD CPUs when using more than 2 GPUs per node

    • P2P would cause a significant performance degradation when using
      many GPUs, and therefore many interleaved data flows.
    • Disable P2P through the CPU when we have 3+ GPUs per node; keep it
      enabled when we only have 2 GPUs.

    Improve the init logs to report the real NCCL function.

    • Make the log report ncclCommInitRank or ncclCommSplit, rather than
      the generic ncclCommInitRankFunc.

    Add a parameter to set the location of the user configuration file.

    • Add NCCL_CONF_FILE environment variable to set where the user's
      configuration file resides.

    Increase default IB timeout

    • Increase IB timeout value from 18 to 20.
    • Should help avoid fatal errors on large RoCE systems.

    Add new check for nvidia peermem

    • On linux kernels 6.6+, /sys/kernel/mm/memory_peers is no longer
      present; check for /sys/module/nvidia_peermem/version instead.

    Fix old performance regression when mixing small and large operations.

    • Improves distribution of work on channels.

    Fix crash when NUMA IDs are equal to -1.

    • Can happen when a NIC is a virtual NIC, or when linux doesn't
      know which NUMA node a device is attached to
    • Issue NVIDIA/nccl-tests#233

    Fix tree graph search when NCCL_CROSS_NIC is set to 1.

    • Would force NCCL to use the balanced_tree pattern, thereby
      disabling LL128 on platforms with 1 GPU+1 NIC per PCI switch.
    • Would also try to use alternate rings even though it was not
      needed.

    Compiler tweaks and fixes

    Fix stack smash

    Fixes for multi-node NVLink + IB operation

    Coverity fixes and comments.

    Downloads