96 Commits

Author SHA1 Message Date
Sylvain Jeaugey
cdae05b277 Improve INFO message when external network is not found.
Fix #162
2018-12-04 12:10:58 -08:00
David Addison
5fe2618c0e Fixed some compilation errors when TRACE=1 set 2018-11-29 14:12:14 -08:00
Sylvain Jeaugey
eed8218e17 Rework shared memory code to use SYSCHECK macros.
This is to handle EINTR/EGAIN properly (issue #137), and also
make the code consistent with the rest.

Unfortunately posix_fallocate and mmap do not follow the classic
return code/errno pattern, so we need to write wrappers around those
functions.
2018-11-29 12:52:13 -08:00
Sylvain Jeaugey
302d538b73 Rework SYSCHECK macros to better handle retries.
SYSCHECKVAL was not retrying when a retry was needed. Since not all
calls are inside a loop, that means we could silently miss an
EINTR/EAGAIN return code.

Also rework the socket connection code and improve error reporting.
2018-11-29 12:52:13 -08:00
Sylvain Jeaugey
61b50a63ef Improve net API description 2018-11-26 16:24:31 -08:00
Sylvain Jeaugey
98adf2fe11 Make network isend/irecv non blocking 2018-11-26 16:24:31 -08:00
Sylvain Jeaugey
0d3a20f96d Add support for external network.
Dynamically load external network from libnccl-net.so.
Add init function in networks.
Move PCI scoring to net.cu, only ask transport to provide a path.
Simplify CUDA PCI path detection.
Add dummy external network
2018-11-26 16:24:31 -08:00
Alex Sergeev
d7a58cfa58 Generate host-hash for P2P and SHM based on $(readlink /proc/self/ns/uts) + $(readlink /proc/self/ns/mnt) (#156) 2018-11-19 17:39:44 -08:00
Sylvain Jeaugey
3c6e25210b
Generate nccl.h in build instead of src
Generating nccl.h in src makes source directories dirty after builds.
2018-11-09 14:00:41 -08:00
David Addison
b56650c7f5 2.3.7-1
Improved LL tuning for multi-node jobs.
Improved bootstrap for large job scaling.
Fixed a hang during bootstrap due to socket reuse.
Added operation name to the COLL INFO logging.
2018-10-24 14:44:59 -07:00
Sylvain Jeaugey
f93fe9bfd9 2.3.5-5
Add support for inter-node communication using sockets and InfiniBand/RoCE.
Improve latency.
Add support for aggregation.
Improve LL/regular tuning.
Remove tests as those are now at github.com/nvidia/nccl-tests .
2018-09-25 14:12:01 -07:00
Sylvain Jeaugey
29a1a916dc Add support for CUDA9 half semantics 2017-06-14 11:20:24 -07:00
Ilya Biryukov
8241cd7b6e Fix compilation error when compiling with 'clang -x cuda'.
Functions vFetch and vStore are not found by ADL with clang,
so they need to be declared before usage in ReduceCopy.
2017-03-16 12:01:11 +01:00
Nathan Luehr
8996811936 Only enable peer access for ring neighbors.
This enables support for systems with more than 9 GPUs attached to a single PCIe root complex.
2017-03-01 16:42:38 -08:00
Sylvain Jeaugey
c219a183d0 Fix copy/paste typo in error message 2017-03-01 16:42:38 -08:00
Sylvain Jeaugey
8e1d6f9b60 Fix crash in Reduce when non-root ranks have invalid recvbuff 2017-03-01 16:42:38 -08:00
Chad Whipkey
5eab428294 Qualify nullptr_t with std::. 2017-02-08 07:06:31 -08:00
Sylvain Jeaugey
2a974f5ca2 Fix 1.3.2 compilation 2016-12-08 09:11:43 -08:00
Sylvain Jeaugey
648e9fbb58 Adding missing file 2016-12-05 18:06:24 -08:00
Sylvain Jeaugey
34d27771c6 1.3.2 release
Broadcast tuning
Better checking of inputs
Copy/reduce code simplification
2016-12-01 15:17:50 -08:00
Sylvain Jeaugey
b2781d0501 Fix primitives function prototype 2016-10-13 10:32:42 -07:00
Sylvain Jeaugey
bf7d1514f7 NVML (libwrap) : import the needed definitions 2016-10-13 10:28:59 -07:00
Sylvain Jeaugey
8bb06c94be Improved allreduce segmentation for small sizes 2016-10-07 12:42:23 -07:00
Sylvain Jeaugey
cabd6848e4 Heavy code refactoring to remove a lot of code in collectives (~1000 lines).
Have all collectives use the same args, the same ring, and the same primitives for synchronization between threads with the same pattern.
2016-09-22 11:57:56 -07:00
Sylvain Jeaugey
e3dbc6110e Add profiling API 2016-09-22 11:56:51 -07:00
Sylvain Jeaugey
9ee6189bf9 Merge pull request #41 from jia-kai/master
Some minor fixes for compile/usage
2016-09-15 09:45:52 -07:00
Sylvain Jeaugey
75bad643bd Updated LICENCE.txt 2016-08-26 15:08:20 -07:00
jiakai
47b0797fe1 pass devlist as const int* rather than int* in ncclCommInitAll 2016-08-19 19:00:14 +08:00
Sylvain Jeaugey
428ec5b2a3 Merge remote-tracking branch 'github/master' into public 2016-07-25 10:53:01 -07:00
Nathan Luehr
55c42ad681 Fixed redundant contexts in multi-process apps
Change-Id: If787014450fd281304f0c7baf01d25963e40905d
2016-07-25 10:10:30 -07:00
Sylvain Jeaugey
e51e922924 Add a debug level to NCCL and CUDA versions at init 2016-06-16 17:04:41 -07:00
Sylvain Jeaugey
d5e507fc7f Only call the CUDA runtime. That may fix #27. 2016-06-07 16:27:51 -07:00
Sylvain Jeaugey
7edfc57228 Make NCCL collectives work on communicators with only one rank 2016-06-06 14:35:00 -07:00
Sylvain Jeaugey
acb93d1aed Removing unneeded includes 2016-06-02 17:33:43 -07:00
Sylvain Jeaugey
dba3ec9428 Fix random deadlock during ncclCommInitRank. 2016-04-19 10:47:27 -07:00
Nathan Luehr
5554a4c9f0 Fixed useRemoteRecv consistency issue.
Change-Id: Ib093a8dc3bb093eddc89dad81d3fffa53c03a6a2
Reviewed-on: http://git-master/r/1013543
Reviewed-by: Cliff Woolley <jwoolley@nvidia.com>
Tested-by: Przemek Tredak <ptredak@nvidia.com>
2016-02-18 13:45:42 -08:00
Nathan Luehr
9442285526 Fixed buffer overflow in ReduceOrCopy
Bug caused AllGathers and ReduceScatters of less than
8 bytes to fail in certain cases.

Change-Id: I33e1beb50805bfdb457ae16a90e3f91c1b283b9b
Reviewed-on: http://git-master/r/1011505
Reviewed-by: Przemek Tredak <ptredak@nvidia.com>
Tested-by: Przemek Tredak <ptredak@nvidia.com>
2016-02-12 15:13:56 -08:00
Nathan Luehr
caa40b8dd3 Libwrap checks for LIB.so.1 if LIB.so not found
Change-Id: I6f07f887f828cb2259dcfd496a2ad707db898cf5
Reviewed-on: http://git-master/r/1000162
Reviewed-by: Przemek Tredak <ptredak@nvidia.com>
Tested-by: Przemek Tredak <ptredak@nvidia.com>
2016-01-29 12:36:42 -08:00
Nathan Luehr
fe1a956715 Enabled support for char type to be unsigned.
GCC on POWER arch defines char type as unsigned.

Change-Id: Ic143cb058fe42414b1f6f1f45b02132c837726ae
Reviewed-on: http://git-master/r/999614
Reviewed-by: Przemek Tredak <ptredak@nvidia.com>
Tested-by: Przemek Tredak <ptredak@nvidia.com>
2016-01-28 13:38:18 -08:00
Sylvain Jeaugey
c05312f151 Moved tests to separate dir and improved MPI test
test sources moved to test/ directory.
MPI test displays PASS/FAIL and returns code accordingly.

Change-Id: I058ebd1bd5202d8f38cc9787898b2480100c102b
Reviewed-on: http://git-master/r/936086
Reviewed-by: Przemek Tredak <ptredak@nvidia.com>
Tested-by: Przemek Tredak <ptredak@nvidia.com>
2016-01-28 12:56:36 -08:00
Nathan Luehr
5966316771 Added support for more than 8 GPUs.
Change-Id: Iaa1841036a7bfdad6ebec99fed0adcd2bbe6ffad
Reviewed-on: http://git-master/r/935459
Reviewed-by: Cliff Woolley <jwoolley@nvidia.com>
Tested-by: Przemek Tredak <ptredak@nvidia.com>
2016-01-21 13:00:21 -08:00
Nathan Luehr
130ee246e2 Fixed deadlock in back-to-back reduce_scatters.
Change-Id: I92d32b15e516a39710b676aee692ae9b70638937
Reviewed-on: http://git-master/r/935458
Reviewed-by: Przemek Tredak <ptredak@nvidia.com>
Tested-by: Przemek Tredak <ptredak@nvidia.com>
2016-01-21 10:36:03 -08:00
Nathan Luehr
651a6edc5c Fixed bug in MPI initialization. 2015-12-10 17:54:41 -08:00
Simon Layton
41ce4ca9fc Add int64 and uint64 types for all algorithms and tests 2015-12-04 13:28:36 -05:00
Nathan Luehr
27d32ac5d9 Fixed a race condition in reduce and braodcast. 2015-11-19 11:11:52 -08:00
Nathan Luehr
0673d5f44f Initial release. 2015-11-17 11:30:40 -08:00