Optimization for Tree allreduce on A100. Improve aggregation performance. Use shared buffers for inter-node send/recv. Add NVTX profiling hooks. Accelerate alltoall connections by merging communication for all channels. Add support for one hop communication through NVLink, for faster send/recv communication on cubemesh topologies like DGX-1. Improve alltoall scheduling to better balance intra/inter node communication. Increase send/recv parallelism by 8x, each warp sending or receiving to a different peer. Net: move to v4. Net: make flush operation asynchronous to accelerate alltoall. Net: define maximum number of requests. Fix hang when using LL128 protocol after 2^31 steps. Fix #379 : topology injection failing when using less GPUs than described in the XML. Fix #394 : protocol mismatch causing hangs or crashes when using one GPU per node.
31 lines
1.2 KiB
Plaintext
31 lines
1.2 KiB
Plaintext
Source: nccl
|
|
Section: libs
|
|
Maintainer: cudatools <cudatools@nvidia.com>
|
|
Priority: optional
|
|
Build-depends: debhelper(>=9)
|
|
Standards-Version: 3.9.5
|
|
|
|
Package: libnccl${nccl:Major}
|
|
Section: libs
|
|
Architecture: ${pkg:Arch}
|
|
Depends: ${misc:Depends}, ${shlibs:Depends}
|
|
Description: NVIDIA Collective Communication Library (NCCL) Runtime
|
|
NCCL (pronounced "Nickel") is a stand-alone library of standard collective
|
|
communication routines for GPUs, implementing all-reduce, all-gather, reduce,
|
|
broadcast, and reduce-scatter.
|
|
It has been optimized to achieve high bandwidth on any platform using PCIe,
|
|
NVLink, NVswitch, as well as networking using InfiniBand Verbs or TCP/IP
|
|
sockets.
|
|
|
|
Package: libnccl-dev
|
|
Section: libdevel
|
|
Architecture: ${pkg:Arch}
|
|
Depends: ${misc:Depends}, ${shlibs:Depends}, libnccl${nccl:Major} (= ${binary:Version})
|
|
Description: NVIDIA Collective Communication Library (NCCL) Development Files
|
|
NCCL (pronounced "Nickel") is a stand-alone library of standard collective
|
|
communication routines for GPUs, implementing all-reduce, all-gather, reduce,
|
|
broadcast, and reduce-scatter.
|
|
It has been optimized to achieve high bandwidth on any platform using PCIe,
|
|
NVLink, NVswitch, as well as networking using InfiniBand Verbs or TCP/IP
|
|
sockets.
|