Add scalable init API
* Add new ncclCommInitRankScalable to allow for passing multiple
unique IDs to the init function.
* Spreads the load onto multiple bootstrap roots, allowing for
constant bootstrap time.
* Requires multiple ranks to create a unique ID, and the CPU-side
ID exchange code to call allgather[v] instead of broadcast.
Accelerate init bootstrap operations
* Reduce the number of calls to allgather.
* Allow roots to reply early to ranks when information is already
available.
* Add an option to use ncclNet instead of sockets to perform
bootstrap allgather operations.
Add PAT algorithms for Allgather and ReduceScatter
* Parallel Aggregated Trees, variation of Bruck algorithm.
* Logarithmic number of network steps for small sizes at scale.
* Only supports one rank per node at the moment.
Add support for registered buffers for intra-node communication.
* Allow registered user buffers to be accessed directly intra-node
* Avoids extra copies in algorithms which permit it, saving
memory bandwidth and helping with compute overlap.
Add profiler plugin API
* New plugin API for profiling
* Supports various levels of profiling, with a hierarchy.
Asynchronous graph allocation
* Make calls to cudaMalloc and cudaMemcpy during graph allocation
asynchronous.
* Significantly speeds up graph capture.
Use fatal IB asynchronous events to stop network operation
* Avoids many other error messages
* Only fatal errors are affected; potentially transient errors
(e.g. port down) do not cause an immediate stop.
Set P2P level to PXB on AMD CPUs when using more than 2 GPUs per node
* P2P would cause a significant performance degradation when using
many GPUs, and therefore many interleaved data flows.
* Disable P2P through the CPU when we have 3+ GPUs per node; keep it
enabled when we only have 2 GPUs.
Improve the init logs to report the real NCCL function.
* Make the log report ncclCommInitRank or ncclCommSplit, rather than
the generic ncclCommInitRankFunc.
Add a parameter to set the location of the user configuration file.
* Add NCCL_CONF_FILE environment variable to set where the user's
configuration file resides.
Increase default IB timeout
* Increase IB timeout value from 18 to 20.
* Should help avoid fatal errors on large RoCE systems.
Add new check for nvidia peermem
* On linux kernels 6.6+, /sys/kernel/mm/memory_peers is no longer
present; check for /sys/module/nvidia_peermem/version instead.
Fix old performance regression when mixing small and large operations.
* Improves distribution of work on channels.
Fix crash when NUMA IDs are equal to -1.
* Can happen when a NIC is a virtual NIC, or when linux doesn't
know which NUMA node a device is attached to
* Issue NVIDIA/nccl-tests#233
Fix tree graph search when NCCL_CROSS_NIC is set to 1.
* Would force NCCL to use the balanced_tree pattern, thereby
disabling LL128 on platforms with 1 GPU+1 NIC per PCI switch.
* Would also try to use alternate rings even though it was not
needed.
Compiler tweaks and fixes
* PR #1177
* PR #1228
Fix stack smash
* PR #1325
Fixes for multi-node NVLink + IB operation
Coverity fixes and comments.
Add local user buffer registration for NVLink SHARP.
Add tuning plugin support.
Increase net API to v7 to allow for device-side packet reordering;
remove support for v4 plugins.
Add support for RoCE ECE.
Add support for C2C links.
Better detect SHM allocation failures to avoid crash with Bus Error.
Fix missing thread unlocks in bootstrap (Fixes#936).
Disable network flush by default on H100.
Move device code from src/collectives/device to src/device.
Add support for IB SHARP to NVLS (NVLink SHARP algorithm).
Add NVLS+Tree algorithm.
Add support for memory management using cuMem* functions.
Use all NICs for Send/Receive operations on systems with more than
one NIC per GPU (#804).
Add ncclCommSplit primitive, with resource sharing option in config.
Fix alltoallv hang (#788)
Increase number of channels on H100 when we're not limited by NVLink.
Improve error reporting in case of IB failure, printing local and
remote ID (#779).
Add build option to allow compilation against RDMA includes instead
of dynamically loading IB verbs symbols (#802).
Fix context creation for progress thread (#803).
NET/IB: add option to use multiple QPs in round-robin mode.
Fix tree performance issue when NVB is disabled on HCM topologies.
Add support for CUDA 12.0, drop Kepler (sm_35).
Support for H100 features.
Make socket code more robust and protected. Solves #555.
Improve performance on large CUDA graphs, reducing dependencies.
Reduce inter-socket bandwidth on AMD CPUs to favor better paths.
Various fixes to ncclCommAbort.
Make service thread polling resistant to EINTR.
Compile with profiling API by default.
Extend NVTX instrumentation with call arguments.
Add network communication through another GPU connected with NVLink
(PXN).
Add aggregation of messages coming from different local GPUs through
PXN and going to the same destination.
Add new v5 plugin API with grouped receives and tags.
Add compat for v4 plugins.
Add naming of NCCL threads to help debugging.
Fix NVLink detection and avoid data corruption when some NVLinks are
down.
Add support for Relaxed Ordering for IB.
Add profiling and timing infrastructure.
Add support for bfloat16.
Add ncclAvg reduction operation.
Improve performance for aggregated operations.
Improve performance for tree.
Improve network error reporting.
Add NCCL_NET parameter to force a specific network.
Add NCCL_IB_QPS_PER_CONNECTION parameter to split IB traffic onto multiple queue pairs.
Fix topology detection error in WSL2.
Fix proxy memory elements affinity (improve alltoall performance).
Fix graph search on cubemesh topologies.
Fix hang in cubemesh during NVB connections.
Optimization for Tree allreduce on A100.
Improve aggregation performance.
Use shared buffers for inter-node send/recv.
Add NVTX profiling hooks.
Accelerate alltoall connections by merging communication for all
channels.
Add support for one hop communication through NVLink, for faster
send/recv communication on cubemesh topologies like DGX-1.
Improve alltoall scheduling to better balance intra/inter node
communication.
Increase send/recv parallelism by 8x, each warp sending or
receiving to a different peer.
Net: move to v4.
Net: make flush operation asynchronous to accelerate alltoall.
Net: define maximum number of requests.
Fix hang when using LL128 protocol after 2^31 steps.
Fix#379 : topology injection failing when using less GPUs than
described in the XML.
Fix#394 : protocol mismatch causing hangs or crashes when using
one GPU per node.
Add support for network collectives.
Add support for XML topology dump/injection.
Add text values for GDR and P2P Levels, including "NVL".
Add speed detection for PCI, Infiniband and Ethernet cards.
Add CPU detection for ARM and AMD CPUs.
Add support for adaptive routing on Infiniband.
Change NET plugin API to v3 : merge PCI path and GPU pointer
capability into a single structure and add other properties.
Add LL128 Protocol.
Rewrite the topology detection and tree/ring creation (#179). Improve
tree performance by sending/receiving from different GPUs. Add
model-based tuning to switch between the different algorithms and
protocols.
Rework P2P/SHM detection in containers (#155, #248).
Detect duplicated devices and return an error (#231).
Add tuning for GCP
Added detection of IBM/Power NVLink bridge device.
Add NUMA support to PCI distance calculations.
Added NCCL_IGNORE_CPU_AFFINITY env var.
Fix memory leaks; GithubIssue#180
Compiler warning fix; GithubIssue#178
Replace non-standard variable length arrays. GithubIssue#171
Fix Tree+Shared Memory crash. GithubPR#185
Fix LL cleanup hang during long running DL jobs.
Fix NCCL_RINGS environment variable handling.
Added extra checks to catch repeat calls to ncclCommDestroy() GithubIssue#191
Improve bootstrap socket connection reliability at scale.
Fix hostname hashing issue. GithubIssue#187
Code cleanup to rename all non device files from *.cu to *.cc
Add tree algorithms for allreduce to improve performance at scale.
Add ncclCommAbort() and ncclCommGetAsyncError() to properly handle
network errors and be permit recover.
Detect initial CPU affinity and no longer escape it.
Add support for inter-node communication using sockets and InfiniBand/RoCE.
Improve latency.
Add support for aggregation.
Improve LL/regular tuning.
Remove tests as those are now at github.com/nvidia/nccl-tests .