261 Commits

Author SHA1 Message Date
Sylvain Jeaugey
e261ecea18 2.23.4-1
Add scalable init API
 * Add new ncclCommInitRankScalable to allow for passing multiple
   unique IDs to the init function.
 * Spreads the load onto multiple bootstrap roots, allowing for
   constant bootstrap time.
 * Requires multiple ranks to create a unique ID, and the CPU-side
   ID exchange code to call allgather[v] instead of broadcast.

Accelerate init bootstrap operations
 * Reduce the number of calls to allgather.
 * Allow roots to reply early to ranks when information is already
   available.
 * Add an option to use ncclNet instead of sockets to perform
   bootstrap allgather operations.

Add PAT algorithms for Allgather and ReduceScatter
 * Parallel Aggregated Trees, variation of Bruck algorithm.
 * Logarithmic number of network steps for small sizes at scale.
 * Only supports one rank per node at the moment.

Add support for registered buffers for intra-node communication.
 * Allow registered user buffers to be accessed directly intra-node
 * Avoids extra copies in algorithms which permit it, saving
   memory bandwidth and helping with compute overlap.

Add profiler plugin API
 * New plugin API for profiling
 * Supports various levels of profiling, with a hierarchy.

Asynchronous graph allocation
 * Make calls to cudaMalloc and cudaMemcpy during graph allocation
   asynchronous.
 * Significantly speeds up graph capture.

Use fatal IB asynchronous events to stop network operation
 * Avoids many other error messages
 * Only fatal errors are affected; potentially transient errors
   (e.g. port down) do not cause an immediate stop.

Set P2P level to PXB on AMD CPUs when using more than 2 GPUs per node
 * P2P would cause a significant performance degradation when using
   many GPUs, and therefore many interleaved data flows.
 * Disable P2P through the CPU when we have 3+ GPUs per node; keep it
   enabled when we only have 2 GPUs.

Improve the init logs to report the real NCCL function.
 * Make the log report ncclCommInitRank or ncclCommSplit, rather than
   the generic ncclCommInitRankFunc.

Add a parameter to set the location of the user configuration file.
 * Add NCCL_CONF_FILE environment variable to set where the user's
   configuration file resides.

Increase default IB timeout
 * Increase IB timeout value from 18 to 20.
 * Should help avoid fatal errors on large RoCE systems.

Add new check for nvidia peermem
 * On linux kernels 6.6+, /sys/kernel/mm/memory_peers is no longer
   present; check for /sys/module/nvidia_peermem/version instead.

Fix old performance regression when mixing small and large operations.
 * Improves distribution of work on channels.

Fix crash when NUMA IDs are equal to -1.
 * Can happen when a NIC is a virtual NIC, or when linux doesn't
   know which NUMA node a device is attached to
 * Issue NVIDIA/nccl-tests#233

Fix tree graph search when NCCL_CROSS_NIC is set to 1.
 * Would force NCCL to use the balanced_tree pattern, thereby
   disabling LL128 on platforms with 1 GPU+1 NIC per PCI switch.
 * Would also try to use alternate rings even though it was not
   needed.

Compiler tweaks and fixes
 * PR #1177
 * PR #1228

Fix stack smash
 * PR #1325

Fixes for multi-node NVLink + IB operation

Coverity fixes and comments.
2024-09-10 23:57:16 -07:00
Sylvain Jeaugey
178b6b7590 2.22.3-1
Rework core for NVIDIA Trusted Computing
 * Compress work structs so that they are shared between channels
 * Utilize the full amount of kernel argument space permitted (4k)
   before resorting to work fifo.
 * Rework the task preprocessing phase.
 * Use a separate abortDevFlag which is kept in sync with abortFlag
   using cudaMemcpy operations.
 * Rename src/include/align.h to src/include/bitops.h

Add lazy connection establishment for collective operations
 * Move buffer allocation and connection establishment to the first
   collective operation using that algorithm.
 * Accelerate init time and reduce memory usage.
 * Avoid allocating NVLS buffers if all calls are registered.
 * Compute algo/proto in ncclLaunchCollTasksInfo early on.
 * Connect peers in ncclCollPreconnectFunc if not connected already.
 * Also move shared buffer creation to the first send/recv call.

Accelerate intra-node NVLink detection
 * Make each rank only detect NVLinks attached to its GPU.
 * Fuse XMLs to reconstruct the full NVLink topology

Add init profiling to report time spend in different init phases.
 * Report timings of bootstrap, allgather, search, connect, etc.
 * Add new "PROFILE" category for NCCL_DEBUG_SUBSYS.

Add support for PCI p2p on split PCI switches
 * Detect split PCI switches through a kernel module exposing
   switch information.
 * Update the topology XML and graph to add those inter-switch
   connections.

Add cost estimation API
 * Add a new ncclGroupEndSimulate primitive to return the estimated
   time a group would take.

Net/IB: Add separate traffic class for fifo messages
 * Add NCCL_IB_FIFO_TC to control the traffic class of fifo messages
   independently from NCCL_IB_TC.
   Merges PR #1194

Net/IB: Add support for IB router
 * Use flid instead of lid if subnets do not match
 * Warn if flid is 0

Optimizations and fixes for device network offload (unpack)
 * Double the default number of channels
 * Cache netDeviceType
 * Fix save/increment head logic to enable Tree support.

Support ncclGroupStart/End for ncclCommAbort/Destroy
 * Allow Abort/Destroy to be called within a group when managing
   multiple GPUs with a single process.

Improve Tuner API
 * Provide to the plugin the original cost table so that the plugin
   can leave unknown or disabled algo/proto combinations untouched.
 * Remove nvlsSupport and collnetSupport.

Do not print version to stdout when using a debug file
 * Also print version from all processes with INFO debug level.
   Fixes issue #1271

Fix clang warnings in NVTX headers
 * Update NVTX headers to the latest version
   Fixes issue #1270

Disable port fusion in heterogeneous systems
 * Do not fuse ports if a mix of multi-port and single port are detected.

Fix NVLS graphs search for dual NICs.
 * Fix NVLS graph search when we have more than one NIC per GPU.

Fix crash with collnetDirect
 * Add separate graph search for collnetDirect, testing alltoall paths
   and working similarly to the NVLS search.

Fix hang when nodes have different CPU types
 * Add the CPU type to the rank peer info.
 * Align all ranks on the CPU type after the first allgather.
 * Only use the aligned CPU type for all tuning operations.
   Fixes issue #1136
   Fixes issue #1184

Fix performance of registered send/recv operations
 * Allow for single full size operations
 * Add INFO to confirm the registration of send/recv buffers.

Move all sync ops to finalize stage
 * Ensure ncclCommDestroy is non-blocking if ncclCommFinalize has
   been called.

Improve error reporting during SHM segment creation

Improve support of various compilers
   Merges PR #1177
   Merges PR #1228

Allow net and tuner plugins to be statically linked
 * Search for ncclNet or ncclTuner symbols in the main binary.
   Merges PR #979

Plugin examples includes cleanup
 * Harmonize err.h and common.h usage.
 * Add mixed plugin with both net and tuner.
v2.22.3-1
2024-06-19 01:57:16 -07:00
Sylvain Jeaugey
529ee691c3 Add decription for regIsGlobal in the NET API documentation 2024-06-14 01:57:43 -07:00
Sylvain Jeaugey
ab2b89c4c3 2.21.5-1
Add support for IB SHARP 1PPN operation with user buffers.
Improve support for MNNVL, add NVLS support and multi-clique support.
 * Detect the NVLS clique through NVML
 * Exchange XML between peers in the same NVLS clique and fuse XMLs
   before creating the topology graph.
 * Rework bootstrap allgather algorithms to allow for large allgather
   operations intra-node (XML exchange).
Net/IB: add support for dynamic GID detection.
 * Automatically select RoCEv2/IPv4 interface by default. Allow to
   select IPv6 or even the network/mask.
Reduce NVLS memory usage.
 * Add stepSize as property of a connection to allow for different
   sizes on different peers; set it to 128K for NVLink SHARP.
Improve tuner loading
 * Look for more paths, be more consistent with the network device
   plugin.
 * Also search for tuner support inside the net plugin.
Improve tuner API
 * Add context to support multi-device per process.
Add magic number around comm object to detect comm corruption.
 * Add some basic check around communicators so that we can report a
   problem when a communicator gets corrupted or a wrong comm pointer
   is passed to NCCL.
Fix net/IB error path. Github PR #1164
Fix collnet rail mapping with split comm.
Fix packet reordering issue causing bootstrap mismatch
 * Use a different tag in ncclTransportP2pSetup for the connectInfo
   exchange and the following barrier.
Fix hang when crossNic is inconsistent between ranks.
Fix minCompCap/maxCompCap computation. Github issue #1184
v2.21.5-1
2024-04-02 01:53:21 -07:00
jbachan
6dd51f15bf
Merge pull request #1217 from crazy-JiangDongHua/bugfix_undo_plan
Bug in plan enqueue logic where plans could be silently not launched for some communicators. Triggered when both are true:
1. Multiple communicators per ncclGroup.
2. Communicators within a group have different plan counts.
2. Intra-process launch barrier disabled.
2024-03-18 10:12:26 -07:00
FrankJ
9ef920a77b [bugfix]save undo plans in some case 2024-03-12 00:00:16 +08:00
Sylvain Jeaugey
48bb7fec79 2.20.5-1
Fix UDS connection failure when using ncclCommSplit. Issue #1185
v2.20.5-1
2024-02-26 02:52:39 -08:00
Sylvain Jeaugey
b6475625fb 2.20.3-1
Add support for alternating rings, allow for cross-nic rings without
cross-rail communication.
Add support for user buffer registration for network send/recv.
Optimize aggregated operations to better utilize all channels.
Add flattening for BCM PCI gen5 switches.
Add support for inter-node NVLink communication
Add support for port fusion in NET/IB.
Add support for ReduceScatter and AllGather using Collnet.
Update net API to v8.
Fix hang during A2A connection.
v2.20.3-1
2024-02-13 04:22:38 -08:00
Sylvain Jeaugey
b6d7438d31 Merge remote-tracking branch 'origin/master' 2023-11-20 05:07:23 -08:00
David Addison
16b5be19f6
Merge pull request #1070 from Flamefire/fix-cpuid2
Fix use of CPUID overwriting registers in use
2023-11-18 11:05:42 -08:00
Alexander Grund
cece6415b0 Fix use of CPUID overwriting registers in use.
CPUID writes to EAX, EBX, ECX, and EDX so the inline-asm must state that.
Otherwise currently in-use register might get overwritten which may
cause all kinds of failures like segfaults or wrong results.

Alternatively `__cpuid` can be used which avoids this and related issues.
So do that as suggested in the GCC issue https://gcc.gnu.org/bugzilla/show_bug.cgi?id=112513
2023-11-14 12:38:02 +01:00
Sylvain Jeaugey
88d44d777f 2.19.4-1
Split transport connect phase into multiple steps to avoid port
exhaustion when connecting alltoall at large scale. Defaults to 128
peers per round.
Fix memory leaks on CUDA graph capture.
Fix alltoallv crash on self-sendrecv.
Make topology detection more deterministic when PCI speeds are not
available (fix issue #1020).
Properly close shared memory in NVLS resources.
Revert proxy detach after 5 seconds.
Add option to print progress during transport connect.
Add option to set NCCL_DEBUG to INFO on first WARN.
v2.19.4-1
2023-11-13 10:36:12 -08:00
Sylvain Jeaugey
0e35f5d390 Merge tag 'v2.19.3-1' 2023-10-25 06:51:36 -07:00
Sylvain Jeaugey
0b083e5209 2.18.6-1 v2.18.6-1 2023-10-10 00:34:18 -07:00
Sylvain Jeaugey
8c6c595185 2.19.3-1
H800/H100 fixes and tuning.
Re-enable intra-process direct pointer buffer access when CUMEM is
enabled.
v2.19.3-1
2023-09-26 05:57:15 -07:00
Sylvain Jeaugey
3435178b6c Merge remote-tracking branch 'origin/master' into v2.19 2023-09-26 05:55:56 -07:00
Sylvain Jeaugey
f9c3dc251e 2.19.1-1
Add local user buffer registration for NVLink SHARP.
Add tuning plugin support.
Increase net API to v7 to allow for device-side packet reordering;
remove support for v4 plugins.
Add support for RoCE ECE.
Add support for C2C links.
Better detect SHM allocation failures to avoid crash with Bus Error.
Fix missing thread unlocks in bootstrap (Fixes #936).
Disable network flush by default on H100.
Move device code from src/collectives/device to src/device.
v2.19.1-1
2023-09-26 05:50:33 -07:00
Kaiming Ouyang
4365458757 Fix cudaMemcpyAsync bug
We are trying to use the copy result of first cudaMemcpyAsync in the
second cudaMemcpyAsync without sync in between. This patch fixes it
by allocating a CPU side array to cache device side addr so that we
can avoid this consecutive cuda mem copy.

Fixes #957
2023-09-20 05:51:14 -07:00
Sylvain Jeaugey
559b70f86c 2.18.5-1
Fix NVLS search (issue #931).
Increase max IB NICs to 32.
Fix inconsistent device ordering (issue #820).
Try to use different devices for different GPUs in systems with
more than one NIC per GFU.
v2.18.5-1
2023-08-23 06:32:36 -07:00
Sylvain Jeaugey
8ed014bae9 Fix inter-node NVLS graph search
We were passing a net ID instead of a gpu index, which could cause
crashes if those were unrelated (and they usually are).

Issue #931
2023-08-02 07:06:35 -07:00
Dmitrii Gabor
6e24ef4e1f Prevent WR index truncation in the InfiniBand transport plugin 2023-06-28 11:39:19 +02:00
Sylvain Jeaugey
ea38312273 2.18.3-1
Fix data corruption with Tree/LL128 on systems with 1GPU:1NIC.
Fix hang with Collnet on bfloat16 on systems with less than one NIC
per GPU.
Fix long initialization time.
Fix data corruption with Collnet when mixing multi-process and
multi-GPU per process.
Fix crash when shared memory creation fails.
Fix Avg operation with Collnet/Chain.
Fix performance of alltoall at scale with more than one NIC per GPU.
Fix performance for DGX H800.
Fix race condition in connection progress causing a crash.
Fix network flush with Collnet.
Fix performance of aggregated allGather/reduceScatter operations.
Fix PXN operation when CUDA_VISIBLE_DEVICES is set.
Fix NVTX3 compilation issues on Debian 10.
v2.18.3-1
2023-06-14 01:29:17 -07:00
Sylvain Jeaugey
d97a32fac8 2.18.1-1
Add support for IB SHARP to NVLS (NVLink SHARP algorithm).
Add NVLS+Tree algorithm.
Add support for memory management using cuMem* functions.
Use all NICs for Send/Receive operations on systems with more than
one NIC per GPU (#804).
Add ncclCommSplit primitive, with resource sharing option in config.
Fix alltoallv hang (#788)
Increase number of channels on H100 when we're not limited by NVLink.
Improve error reporting in case of IB failure, printing local and
remote ID (#779).
Add build option to allow compilation against RDMA includes instead
of dynamically loading IB verbs symbols (#802).
Fix context creation for progress thread (#803).
NET/IB: add option to use multiple QPs in round-robin mode.
Fix tree performance issue when NVB is disabled on HCM topologies.
v2.18.1-1
2023-04-18 03:58:25 -07:00
David Addison
9b7d5edbfc
Merge pull request #822 from KaimingOuyang/github/pytorch-hang-fix
Shutdown socket before close in ncclSocketClose()
2023-04-14 19:52:45 -07:00
Kaiming Ouyang
006b6bc7dc Add a comment to shutdown() in ncclSocketClose 2023-04-13 09:13:44 -07:00
Kaiming Ouyang
367e9b61c3 Shutdown socket before close in ncclSocketClose() 2023-04-13 09:11:52 -07:00
Sylvain Jeaugey
5d3ab08b69 2.17.1-1
Add new NVLS algorithm for allreduce using NVLink SHARP (intra-node only).
Add new config options: cgaClusterSize, minCTAs, maxCTAs, netName.
Enable LL128 when we use PXN to close rings.
NVTX3 includes update.
Fix crash when one CollNet (SHARP) rail fails to initialize.
v2.17.1-1
2023-03-01 00:39:04 -08:00
Sylvain Jeaugey
f3d5166783 2.16.5-1
Add support for 400Gbit NDR network adapters (CX7)
Handle EINTR in socket poll() function
Add NCCL_PROGRESS_APPENDOP_FREQ to control op append overhead
Resource cleanup fixes
Fix double free in case of init failure
Fix crash in ncclCommAbort
Revert AMD speed commit
v2.16.5-1
2023-02-02 12:52:47 -08:00
Rashika Kheria
93840e7476 Fix maximum handle size for NCCL Net v4 API
NCCL Net v4 supports a maximum handle size of 64 bytes whereas the
ext-net example header files set it for NCCL Net v3. Since,
`aws-ofi-nccl` plugin plans to follow the example header files, fix it
here.

Signed-off-by: Rashika Kheria <rashika@amazon.com>
2023-01-18 13:31:57 +01:00
Sylvain Jeaugey
28189e2df8 2.16.2-1
Add support for CUDA 12.0, drop Kepler (sm_35).
Support for H100 features.
Make socket code more robust and protected. Solves #555.
Improve performance on large CUDA graphs, reducing dependencies.
Reduce inter-socket bandwidth on AMD CPUs to favor better paths.
Various fixes to ncclCommAbort.
Make service thread polling resistant to EINTR.
Compile with profiling API by default.
Extend NVTX instrumentation with call arguments.
v2.16.2-1
2022-11-30 02:31:59 -08:00
Sylvain Jeaugey
614b49f0de Fix google-fastsocket plugin build 2022-11-22 02:13:13 -08:00
Sylvain Jeaugey
55b1d8ab98 Add documentation for NCCL NET plugins
Also repurpose dummy plugin as example, including headers and
compat layers from v6 to v2.
2022-11-22 02:12:53 -08:00
Sylvain Jeaugey
2f4cb874ba Merge tag 'v2.15.5-1' 2022-10-25 01:15:22 -07:00
Sylvain Jeaugey
cb111f764a 2.15.5-1
Fix crash with CollnetChain on some node topologies
Fix hang when interleaving the capture of different graphs
Fix hang during init in multi-threaded mode
Fix potential data corruption with LL128 protocol on unaligned buffers.
Fix CPU usage during preconnect
Fixes double-free in the error path for ncclCommInitAll
Workaround hang on H100 with Ring/LL128 on 2 GPUs.
v2.15.5-1
2022-10-25 00:55:55 -07:00
Sylvain Jeaugey
d128d62238 Merge tag 'v2.15.1-1' 2022-10-07 11:00:26 -07:00
John Bachan
2401f4a918 Fixes a double-free in the error path of ncclCommInitAll.
Fixes https://github.com/NVIDIA/nccl/issues/726
2022-10-03 17:12:32 -07:00
Sylvain Jeaugey
da8152e57a 2.15.1-1
Add support for H100 (sm90).
Make sure NCCL kernel honor user stream priorities.
v2.15.1-1
2022-09-27 02:31:13 -07:00
Sylvain Jeaugey
99c28f2e75 Merge remote-tracking branch 'origin/master' 2022-09-27 02:24:41 -07:00
Cliff Woolley
78313a6d21 Use compatibility shim only with static cudart
Closes issue 658
2022-09-27 02:22:48 -07:00
Sylvain Jeaugey
ecab28a7c9 Fix potential deadlock during init in multi-thread mode.
Make sure all calls calling cudaMalloc (including devCommSetup) are
called before the last bootstrapBarrier. That way, we avoid calls to
cudaMalloc be blocked by a NCCL kernel launched on another GPU by
another thread which completed init faster.

Resolve #623.
2022-09-26 02:13:10 -07:00
Jane Xu
f89fd4777d address review comments 2022-09-20 11:58:33 +02:00
Jane Xu
79fb0326ac Fix intermittent 11.6 builds: generate unique .cu file for each object file 2022-09-20 11:58:33 +02:00
Sylvain Jeaugey
c4e2aa6c79 2.14.3-1
Add support for improved fault tolerance: non-blocking mode, new
init function with config, and ncclCommFinalize function.
Reintroduce collnet+chain algorithm, alongside collnet+direct.
Add LL protocol for intra-node P2P (on by default) and network
communication (off by default).
Use network instead of shared memory when performance is better.
Fix: wait for CUDA graph destroy before destroying comm with linked
graph resources.
Remove aggressive polling during enqueue.
Fix DMABUF fallback on MOFED 5.4 and earlier.
v2.14.3-1
2022-08-18 02:53:17 -07:00
Ching-Hsiang Chu
e1d9b273b0 fix NCCL_DEBUG_FILE
Summary: NCCL_DEBUG_FILE does not work properly since the recent v2.13.4 updates (https://github.com/NVIDIA/nccl/pull/682) because it nows sets `ncclDebugLevel` after parse `NCCL_DEBUG_FILE`. This patch move parsing `tempNcclDebugLevel` before processing `NCCL_DEBUG_FILE` to ensure `NCCL_DEBUG_FILE` is parsed only when `NCCL_DEBUG > NCCL_LOG_VERSION` (same as previous behavior)

Differential Revision: D38415208

fbshipit-source-id: 5689bbb798e73efb9e8594557666987f07e89a30
2022-08-18 11:50:42 +02:00
Sylvain Jeaugey
19ab67d172 2.13.4-1
Optimize CUDA graph launch; avoid launching a CPU callback for
intra-node operations.
Simplify kernel common code to improve the latency of send/recv
operations.
Strengthen CUDA streams semantics.
Change NET API to v6, to add dmabuf support.
Add ncclGetLastError() function.
Add ncclRemoteError code and use it for remote network errors.
Support the use of a different NCCL_NET parameter per communicator.
Add support for SHM and P2P transfers using cudaMemcpy.
v2.13.4-1
2022-07-11 08:10:34 -07:00
Sylvain Jeaugey
7aa1c46fd5 2.12.12-1
Improve allreduce performance when we have more than one network interface per
GPU and we need to use PXN to close rings.
Add support for PCI Gen5 on 5.4 kernels.
Fix crash when setting NCCL_SET_THREAD_NAME.
Fix random crash in init due to uninitialized struct.
Fix hang on cubemesh topologies.
Add P2P_DIRECT_DISABLE parameter to disable direct access to pointers within a
process.
v2.12.12-1
2022-05-13 00:26:57 -07:00
Sylvain Jeaugey
9bfc1c6e35 Update Makefile to install static library.
Make sure make install also installs the static library. 
Fixes #662
2022-04-08 14:00:43 +02:00
Sylvain Jeaugey
8133784b32 Merge remote-tracking branch 'origin/master' 2022-03-30 02:29:05 -07:00
Sylvain Jeaugey
353e8ba446 2.12.10-1
Fix bug with CollNet
Fix bug with zero-bytes send/recv operations
Fix NCCL_PARAM implementation to avoid taking a lock on every call
Fix bug when setting NCCL_IB_QPS_PER_CONNECTION to more than one.
Improve error reporting for network errors.
v2.12.10-1
2022-03-30 02:27:01 -07:00
Sylvain Jeaugey
2247152a8e Fix merging error 2022-03-30 02:14:32 -07:00