Symmetric memory API and symmetric kernels * Redesign from the ground up, enabling major latency and bandwidth improvements. * Add new API calls to register user-allocated memory among communicator ranks into a NCCL window: ncclCommWindowRegister() and ncclCommWindowDeregister(). The calls currently support symmetric registration for P2P and NVLS, and require VMM memory buffers (i.e., CUMEM must be operational). * Implement specialized kernels taking advantage of symmetrically registered memory, with performance gains expected particularly for small to medium message sizes. * The kernels support 32 bit floating point types and smaller, and sum as the reduction operator, with no more than one collective operation per group. * Floating point summation is always done in fp32 accumulators (with the exception of fp8 on NVLS, where it uses fp16 inside the switch). Thus, the accuracy with fp8 and fp16 data types should be much improved. * This initial implementation supports non-network communicators only (P2P and NVLS transports). * To explore this functionality users need to use the new memory registration API calls with the NCCL_WIN_COLL_SYMMETRIC flag and all ranks of a communicator must pass buffers at the same offset in the same registration when invoking a collective NCCL operation. Add support for DGX Spark. Add support for DirectNIC (CX8) to the internal IB plugin. Add a new ncclCommShrink() API call * It is a non-collective call similar to ncclCommSplit(), which makes it possible to exclude some (possibly unresponsive) ranks from the parent communicator. Add support for loading multiple network plugins * This enables the creation of generic containers that can work across a range of providers. * Allow NCCL_NET_PLUGIN to accept a comma-separated list of plugins to load. NVLink SHARP (NVLS) improvements * Implement NVLS+IB SHARP support for AllGather and ReduceScatter with user buffer registration. This improves performance and reduces the number of CTAs needed to achieve peak bandwidth. * Gracefully fall back by default to other transports if NVLS initialization fails (the old behavior of returning an error code from a NCCL call can be preserved by setting NCCL_NVLS_ENABLE=1). * Decrease the NVLS channel count to 24 on Blackwell systems with multiple NVLink domains per communicator. * Enable fine-tuning of NCCL behavior per communicator using new "ncclConfig_t" members "collnetEnable", "CTAPolicy", and "nvlsCTAs". Profiler improvements * Extend the init function by adding communicator name, comm id (hash), rank, number of ranks, number of nodes, and the NCCL log function to the argument list. This makes the name and the comm id available to all events in the communicator without explicitly passing them to each individual event. Add the communicator id and rank to the profiler trace filename. Now, the communicator name can be set via a new "ncclConfig_t" member "commName". * Improve the accuracy of the GPU kernel events by providing GPU-generated timestamps for the start and stop of every NCCL operation. * Harmonize proxy events, removing overlaps between ProxyOp and ProxyStep states. * Add support for network-defined event updates (through "recordEventState"). * Report the correct number of channels used by every collective/p2p operation (used to be set to nMaxChannels for collectives and absent for p2ps). * Fix the logic on proxyCtrl Idle/Active events (Issue #1162). * Fix an issue where the network proxy profiler could lose track of an event identifier (Issue #1682). * Improve the backward compatibility with plugins older than v4. * Ensure that the work counters are 0-initialized. * Fix a potential race condition in the network profiler that could result in an event being linked to a wrong parent. MNNVL improvements * Increase to 16 the number of NICs used to communicate between MNNVL domains on GB200 systems, to optimize the performance of collective operations. * Add support for more complex MNNVL topologies with up to 32 NICs per node. * If the MNNVL fabric initialization was unsuccessful, NCCL will now fail by default, so as to avoid inadvertently falling back to a potentially much slower network transport. Such failures are typically due to a misconfigured IMEX support on the system. To continue without MNNVL, restart the job with NCCL_MNNVL_ENABLE=0. * Fix a potential hang in alltoall-like communication patterns at a scale of over 80 ranks. * Make NCCL_P2P_DISABLE=1 imply NCCL_MNNVL_ENABLE=0 (so the latter no longer needs to be specified on MNNVL systems). * Fix an initialization failure when NCCL_TOPO_FILE is used on MNNVL systems. * Fix the graph search to exclude non-local NICs. * Fix the SHM transport to use fabric handles on MNNVL systems. NIC Fusion improvements * Disable the creation of fused NICs for physical devices that haven't been merged. * Flatten multiple ports to a single PCI device within the internal IB plugin and reparent dual-port NICs under the first PCI parent. If the parent is not a PCI switch, PCI devices for fused NICs won't be duplicated. * Route traffic on GB200-CX8 systems through DirectNIC, not the host interface. Improve support for platforms with C2C connectivity (e.g., GB200) * Enable GPUDirect RDMA for the NICs by default. * Add support for P2C (PXN over C2C) and the LL128 protocol. Extend NCCL fault tolerance in multithreaded scenarios * Support the creation of multiple nonblocking communicators within a single group and polling in parallel for the completion using multiple threads (one per communicator). Enable ncclImplicitOrderLaunch for CUDA 12.9+ * This can potentially speed up NCCL_IMPLICIT_LAUNCH_ORDER. Improve the netSocket transport latency and control * Provide finer control over the size of the socket send/receive buffers, the task size, and the number of sockets that a single peer can open. * Add support for the inlining of small messages behind the header when using multiple sockets per connection. Improve the readability of the CPU affinity in the debug output * Print it as a range string rather than a bitmask. Fix a potential race condition in graph execution * A contention could arise when mixing graph and non-graph execution. Improve PXN connection code * Avoid duplicate and unused connections. RAS fixes * Fix a memory corruption at job termination time in case of a previously failed initialization of a RAS socket connection. * Fix a race condition leading to a crash when generating a RAS report during communicator initialization (Issues #1669, #1718). * Fix a potential race condition when gathering data for a RAS status report. Fix a potential memory corruption in ncclCommSplit() * Memory could get corrupted when resource sharing was in use and the size of the NVLink domain in the new communicator was smaller than in the old one. Fix asynchronous graph upload * Fix a small memory leak. * Fix oversychronization. Add a check for out-of-memory conditions in ncclMemAlloc() Clean up the NCCL socket code * accept() will retry also if just reading the magic failed (Issue #1613). * connect() will retry also if poll() did not return a POLLOUT event (Issue #1618). * Add error checking in a few instances (Issue #1539). * Fix the loop condition in ncclFindInterfaceMatchSubnet() (Issue #1574). * Clean up the debug output, downgrading WARN messages to INFO in non-critical cases, and printing the peer's address where relevant. Switch NCCL_DEBUG_FILE to line buffering * This should help avoid mixed-up partial output lines in multithreaded cases. Other minor fixes * Improve the checks for buffer overflows in the graph code (Issue #1585). * Extend logging and state clearing to all four events in the internal IB plugin (Issue #1650). * Fix the error path in case IB communication is not ready (Issue #1489). * Add ECE logging for IB fabric. * Fix various minor issues in the graph module (Issue #1635). * Clean up the debug output in the graph code, downgrading WARN messages to INFO in non-critical cases. * Add a missing argument to a directSend() call (Issue #1628). * Remove duplicate code in sendProxySetup() (Issue #1420). * Fix the order of arguments of cudaDeviceCanAccessPeer() (Issue #1507). * Fix compiler warnings with GCC 14. * Fix a typo in a comment (Issue #1236).
648 lines
25 KiB
C
648 lines
25 KiB
C
/*************************************************************************
|
|
* Copyright (c) 2024, NVIDIA CORPORATION. All rights reserved.
|
|
*
|
|
* See LICENSE.txt for license information
|
|
************************************************************************/
|
|
|
|
#include <stdio.h>
|
|
#include <pthread.h>
|
|
#include <string.h>
|
|
#include <linux/limits.h>
|
|
#include <sys/time.h>
|
|
#include <sys/types.h>
|
|
#include <sys/syscall.h>
|
|
#include <unistd.h>
|
|
#include <x86intrin.h>
|
|
#include "event.h"
|
|
#include "print_event.h"
|
|
|
|
#define __hidden __attribute__ ((visibility("hidden")))
|
|
|
|
static int initialized; // initialization counter for profiler
|
|
static double startTime; // profiler start time
|
|
|
|
static const int defaultEActivationMask = ncclProfileColl | ncclProfileP2p;
|
|
static const int defaultGroupPoolSize = 16;
|
|
static const int defaultCollPoolSize = 16;
|
|
static const int defaultP2pPoolSize = 1024;
|
|
static const int defaultProxyCtrlPoolSize = 16;
|
|
static const int defaultDetachPoolSize = 128;
|
|
|
|
static int groupPoolSize;
|
|
static int collPoolSize;
|
|
static int p2pPoolSize;
|
|
static int proxyCtrlPoolSize;
|
|
static int detachPoolSize;
|
|
static int detachPoolBase;
|
|
static int detachPoolIndex;
|
|
static int detachPoolDone;
|
|
static struct proxyOp* detachPool;
|
|
|
|
ncclDebugLogger_t logFn;
|
|
#define INFO(FLAGS, ...) logFn(NCCL_LOG_INFO, (FLAGS), __func__, __LINE__, __VA_ARGS__)
|
|
|
|
static double freq = -1;
|
|
__hidden void calibrate() {
|
|
struct timeval tv;
|
|
gettimeofday(&tv, NULL);
|
|
uint64_t timeCycles = __rdtsc();
|
|
double time = - tv.tv_sec*1e6 - tv.tv_usec;
|
|
uint64_t total = 0ULL;
|
|
for (int i = 0; i < 10000; i++) total += __rdtsc();
|
|
gettimeofday(&tv, NULL);
|
|
timeCycles = __rdtsc() - timeCycles;
|
|
time += tv.tv_sec*1e6 + tv.tv_usec;
|
|
freq = timeCycles / time;
|
|
}
|
|
|
|
__hidden double gettime(void) {
|
|
return __rdtsc() / freq;
|
|
}
|
|
|
|
static pthread_mutex_t lock = PTHREAD_MUTEX_INITIALIZER;
|
|
static pid_t pid;
|
|
static int* eActivationMaskPtr;
|
|
|
|
__hidden ncclResult_t exampleProfilerInit(void** context, int* eActivationMask, const char* commName, uint64_t commHash, int nNodes, int nranks, int rank, ncclDebugLogger_t logfn) {
|
|
pthread_mutex_lock(&lock);
|
|
if (__atomic_fetch_add(&initialized, 1, __ATOMIC_RELAXED) == 0) {
|
|
// first thread initializes event mask, environment and detach pool
|
|
const char* str;
|
|
str = getenv("NCCL_PROFILE_EVENT_MASK");
|
|
__atomic_store_n(eActivationMask, str ? atoi(str) : 0, __ATOMIC_RELAXED);
|
|
|
|
str = getenv("NCCL_PROFILE_GROUP_POOL_SIZE");
|
|
groupPoolSize = str ? atoi(str) : defaultGroupPoolSize;
|
|
|
|
str = getenv("NCCL_PROFILE_COLL_POOL_SIZE");
|
|
collPoolSize = str ? atoi(str) : defaultCollPoolSize;
|
|
|
|
str = getenv("NCCL_PROFILE_P2P_POOL_SIZE");
|
|
p2pPoolSize = str ? atoi(str) : defaultP2pPoolSize;
|
|
|
|
str = getenv("NCCL_PROFILE_PROXY_CTRL_POOL_SIZE");
|
|
proxyCtrlPoolSize = str ? atoi(str) : defaultProxyCtrlPoolSize;
|
|
|
|
str = getenv("NCCL_PROFILE_PROXY_DETACH_POOL_SIZE");
|
|
detachPoolSize = str ? atoi(str) : defaultDetachPoolSize;
|
|
|
|
// detach pool is used to store PXN proxyOps and is shared among threads
|
|
detachPool = (struct proxyOp *)calloc(detachPoolSize, sizeof(*detachPool));
|
|
if (detachPool == NULL) {
|
|
pthread_mutex_unlock(&lock);
|
|
return ncclSystemError;
|
|
}
|
|
// Pid of the process initializing the profiler first.
|
|
// This is compared against the pid of proxyOp events
|
|
// to figure out if they have a parent event in this
|
|
// process address space.
|
|
pid = getpid();
|
|
|
|
// calibrate and start timer
|
|
calibrate();
|
|
startTime = gettime();
|
|
}
|
|
pthread_mutex_unlock(&lock);
|
|
|
|
// store pointer to activation mask globally
|
|
eActivationMaskPtr = eActivationMask;
|
|
|
|
// pre-allocate memory for event object pools in dedicated profiler context
|
|
struct context* ctx = (struct context *)calloc(1, sizeof(*ctx));
|
|
ctx->commName = commName;
|
|
ctx->commHash = commHash;
|
|
ctx->nranks = nranks;
|
|
ctx->rank = rank;
|
|
logFn = logfn;
|
|
INFO(NCCL_INIT, "PROFILER/Plugin: init commName: %s commHash: %lu nranks: %d rank: %d", commName ? commName : "", commHash, nranks, rank);
|
|
|
|
ctx->groupPool = (struct group *)calloc(groupPoolSize, sizeof(*ctx->groupPool));
|
|
if (ctx->groupPool == NULL) goto fail;
|
|
|
|
ctx->collPool = (struct collective *)calloc(collPoolSize, sizeof(*ctx->collPool));
|
|
if (ctx->collPool == NULL) goto fail;
|
|
|
|
ctx->p2pPool = (struct p2p *)calloc(p2pPoolSize, sizeof(*ctx->p2pPool));
|
|
if (ctx->p2pPool == NULL) goto fail;
|
|
|
|
ctx->proxyCtrlPool = (struct proxyCtrl *)calloc(proxyCtrlPoolSize, sizeof(*ctx->proxyCtrlPool));
|
|
if (ctx->proxyCtrlPool == NULL) goto fail;
|
|
|
|
// Print event pool sizes for debugging
|
|
//fprintf(stdout, "Profiler: Group pool size (bytes): %lu\n", sizeof(struct group)*groupPoolSize);
|
|
//fprintf(stdout, "Profiler: Coll pool size (bytes): %lu\n", sizeof(struct collective)*collPoolSize);
|
|
//fprintf(stdout, "Profiler: P2p pool size (bytes): %lu\n", sizeof(struct p2p)*p2pPoolSize);
|
|
//fprintf(stdout, "Profiler: Proxy pool size (bytes): %lu\n", sizeof(struct proxyCtrl)*proxyCtrlPoolSize);
|
|
//fprintf(stdout, "Profiler: PXN pool size (bytes): %lu\n", sizeof(struct proxyOp)*detachPoolSize);
|
|
|
|
*context = ctx;
|
|
return ncclSuccess;
|
|
|
|
fail:
|
|
// cleanup resources
|
|
if (ctx->proxyCtrlPool) free(ctx->proxyCtrlPool);
|
|
if (ctx->p2pPool) free(ctx->p2pPool);
|
|
if (ctx->collPool) free(ctx->collPool);
|
|
if (ctx->groupPool) free(ctx->groupPool);
|
|
free(ctx);
|
|
if (detachPool) free(detachPool);
|
|
return ncclSystemError;
|
|
}
|
|
|
|
__hidden ncclResult_t exampleProfilerFinalize(void* context) {
|
|
FILE* fh = NULL;
|
|
char filename[PATH_MAX] = { 0 };
|
|
struct context* ctx = (struct context *)context;
|
|
const char* dump = getenv("NCCL_PROFILE_DUMP_FILE");
|
|
if (dump) {
|
|
sprintf(filename, "%s_%lu_%d.json", dump, ctx->commHash, ctx->rank);
|
|
fh = fopen(filename, "w");
|
|
fprintf(fh, "[\n");
|
|
}
|
|
INFO(NCCL_INIT, "PROFILER/Plugin: finalize commName: %s commHash: %lu nranks: %d rank: %d", ctx->commName ? ctx->commName : "", ctx->commHash, ctx->nranks, ctx->rank);
|
|
|
|
// print last N groups/collectives/p2ps
|
|
int start = (ctx->groupPoolIndex - groupPoolSize >= 0) ? ctx->groupPoolIndex - groupPoolSize : 0;
|
|
int end = ctx->groupPoolIndex;
|
|
for (int i = start; i < end; i++) {
|
|
printEvent(fh, &ctx->groupPool[i%groupPoolSize]);
|
|
}
|
|
|
|
start = (ctx->proxyCtrlPoolIndex - proxyCtrlPoolSize >= 0) ? ctx->proxyCtrlPoolIndex - proxyCtrlPoolSize : 0;
|
|
end = ctx->proxyCtrlPoolIndex;
|
|
for (int i = start; i < end; i++) {
|
|
printEvent(fh, &ctx->proxyCtrlPool[i%proxyCtrlPoolSize]);
|
|
}
|
|
|
|
free(ctx->groupPool);
|
|
free(ctx->collPool);
|
|
free(ctx->p2pPool);
|
|
free(ctx->proxyCtrlPool);
|
|
free(ctx);
|
|
|
|
// last thread cleans up shared detach pool
|
|
if (__atomic_sub_fetch(&initialized, 1, __ATOMIC_RELAXED) == 0) {
|
|
start = (detachPoolIndex - detachPoolSize >= 0) ? detachPoolIndex - detachPoolSize : 0;
|
|
end = detachPoolIndex;
|
|
for (int i = start; i < end; i++) {
|
|
printEvent(fh, &detachPool[i%detachPoolSize]);
|
|
}
|
|
free(detachPool);
|
|
}
|
|
|
|
if (fh) fprintf(fh, "{}]\n");
|
|
if (fh) fclose(fh);
|
|
|
|
return ncclSuccess;
|
|
}
|
|
|
|
__hidden void updateEvent(void* handle);
|
|
|
|
__hidden ncclResult_t exampleProfilerStartEvent(void* context, void** eHandle, ncclProfilerEventDescr_t* eDescr) {
|
|
*eHandle = NULL;
|
|
struct context* ctx = (struct context *)context;
|
|
if (eDescr->type == ncclProfileGroup) {
|
|
struct group* event;
|
|
int groupId = __atomic_fetch_add(&ctx->groupPoolIndex, 1, __ATOMIC_RELAXED);
|
|
if ((groupId - __atomic_load_n(&ctx->groupPoolBase, __ATOMIC_RELAXED)) < groupPoolSize) {
|
|
// if there are available group events grab one
|
|
event = &ctx->groupPool[groupId%groupPoolSize];
|
|
while (!taskEventQueueEmpty(event)) {
|
|
struct taskEventBase* base = taskEventQueueDequeue(event);
|
|
if (base->type == ncclProfileColl) {
|
|
struct collective* c = (struct collective *)base;
|
|
// reset event proxyOps & proxySteps
|
|
memset(c->nProxyOps, 0, sizeof(int)*MAX_CHANNELS);
|
|
// release collective events in the group and return them to the collective pool
|
|
__atomic_fetch_add(&ctx->collPoolBase, 1, __ATOMIC_RELAXED);
|
|
} else if (base->type == ncclProfileP2p) {
|
|
struct p2p* p = (struct p2p *)base;
|
|
// reset event proxyOp and proxySteps
|
|
memset(&p->op, 0, sizeof(struct proxyOp)*MAX_CHANNELS);
|
|
// release p2p events in the group and return them to the p2p pool
|
|
__atomic_fetch_add(&ctx->p2pPoolBase, 1, __ATOMIC_RELAXED);
|
|
}
|
|
}
|
|
} else {
|
|
// else drop this event
|
|
__atomic_fetch_sub(&ctx->groupPoolIndex, 1, __ATOMIC_RELAXED);
|
|
return ncclSuccess;
|
|
}
|
|
event->type = ncclProfileGroup;
|
|
event->ctx = ctx;
|
|
event->groupId = groupId;
|
|
event->startTs = gettime() - startTime;
|
|
*eHandle = event;
|
|
debugEvent(event, "GroupStart");
|
|
} else if (eDescr->type == ncclProfileColl) {
|
|
// the parent might be null if we run out of events
|
|
struct group* parent = (struct group *)eDescr->parentObj;
|
|
if (parent == NULL) return ncclSuccess;
|
|
|
|
struct collective* event;
|
|
int collId = __atomic_fetch_add(&ctx->collPoolIndex, 1, __ATOMIC_RELAXED);
|
|
if ((collId - __atomic_load_n(&ctx->collPoolBase, __ATOMIC_RELAXED)) < collPoolSize) {
|
|
// if there are available collective events grab one
|
|
event = &ctx->collPool[collId%collPoolSize];
|
|
} else {
|
|
// else drop this event
|
|
__atomic_fetch_sub(&ctx->collPoolIndex, 1, __ATOMIC_RELAXED);
|
|
return ncclSuccess;
|
|
}
|
|
|
|
event->base.type = ncclProfileColl;
|
|
event->base.rank = eDescr->rank;
|
|
event->base.func = eDescr->coll.func;
|
|
event->base.startTs = gettime() - startTime;
|
|
event->base.parent = parent;
|
|
event->seqNumber = eDescr->coll.seqNumber;
|
|
event->sendBuff = eDescr->coll.sendBuff;
|
|
event->recvBuff = eDescr->coll.recvBuff;
|
|
event->count = eDescr->coll.count;
|
|
event->root = eDescr->coll.root;
|
|
event->datatype = eDescr->coll.datatype;
|
|
event->nChannels = eDescr->coll.nChannels;
|
|
event->nWarps = eDescr->coll.nWarps;
|
|
event->algo = eDescr->coll.algo;
|
|
event->proto = eDescr->coll.proto;
|
|
*eHandle = event;
|
|
taskEventQueueEnqueue(parent, (struct taskEventBase *)event);
|
|
// increment the group ref counter so the event will staty open
|
|
__atomic_fetch_add(&parent->refCount, 1, __ATOMIC_RELAXED);
|
|
debugEvent(event, "CollStart");
|
|
} else if (eDescr->type == ncclProfileP2p) {
|
|
// the parent might be null if we run out of events
|
|
struct group* parent = (struct group *)eDescr->parentObj;
|
|
if (parent == NULL) return ncclSuccess;
|
|
|
|
struct p2p* event;
|
|
int p2pId = __atomic_fetch_add(&ctx->p2pPoolIndex, 1, __ATOMIC_RELAXED);
|
|
if ((p2pId - __atomic_load_n(&ctx->p2pPoolBase, __ATOMIC_RELAXED)) < p2pPoolSize) {
|
|
// if there are available p2p events grab one
|
|
event = &ctx->p2pPool[p2pId%p2pPoolSize];
|
|
} else {
|
|
// else drop this event
|
|
__atomic_fetch_sub(&ctx->p2pPoolIndex, 1, __ATOMIC_RELAXED);
|
|
return ncclSuccess;
|
|
}
|
|
|
|
event->base.type = ncclProfileP2p;
|
|
event->base.rank = eDescr->rank;
|
|
event->base.func = eDescr->p2p.func;
|
|
event->base.next = parent->eventHead;
|
|
event->base.startTs = gettime() - startTime;
|
|
event->base.parent = parent;
|
|
event->buff = eDescr->p2p.buff;
|
|
event->count = eDescr->p2p.count;
|
|
event->datatype = eDescr->p2p.datatype;
|
|
event->peer = eDescr->p2p.peer;
|
|
event->nChannels = eDescr->p2p.nChannels;
|
|
*eHandle = event;
|
|
// increment the group ref counter so the event will staty open
|
|
taskEventQueueEnqueue(parent, (struct taskEventBase *)event);
|
|
__atomic_fetch_add(&parent->refCount, 1, __ATOMIC_RELAXED);
|
|
debugEvent(event, "P2pStart");
|
|
} else if (eDescr->type == ncclProfileProxyCtrl) {
|
|
int proxyCtrlId = __atomic_fetch_add(&ctx->proxyCtrlPoolIndex, 1, __ATOMIC_RELAXED);
|
|
struct proxyCtrl* event = &ctx->proxyCtrlPool[proxyCtrlId%proxyCtrlPoolSize];
|
|
event->type = ncclProfileProxyCtrl;
|
|
event->ctx = ctx;
|
|
event->startTs = gettime() - startTime;
|
|
*eHandle = event;
|
|
} else if (eDescr->type == ncclProfileProxyOp) {
|
|
// the eventBase might be null if we run out of events
|
|
struct taskEventBase* eventBase = (struct taskEventBase *)eDescr->parentObj;
|
|
if (eventBase == NULL) return ncclSuccess;
|
|
|
|
if (eDescr->proxyOp.pid != pid) {
|
|
// PXN captured proxyOp events
|
|
struct proxyOp* event;
|
|
int detachId = __atomic_fetch_add(&detachPoolIndex, 1, __ATOMIC_RELAXED);
|
|
if ((detachId - detachPoolBase) < detachPoolSize) {
|
|
// if there are available detached proxyOp events grab one
|
|
event = &detachPool[detachId%detachPoolSize];
|
|
} else {
|
|
// else drop this event
|
|
__atomic_fetch_sub(&detachPoolIndex, 1, __ATOMIC_RELAXED);
|
|
return ncclSuccess;
|
|
}
|
|
|
|
event->type = ncclProfileProxyOp;
|
|
event->channelId = eDescr->proxyOp.channelId;
|
|
event->pid = eDescr->proxyOp.pid;
|
|
event->rank = eDescr->rank;
|
|
event->peer = eDescr->proxyOp.peer;
|
|
event->nSteps = eDescr->proxyOp.nSteps;
|
|
event->chunkSize = eDescr->proxyOp.chunkSize;
|
|
event->isSend = eDescr->proxyOp.isSend;
|
|
event->startTs = gettime() - startTime;
|
|
event->parent = NULL;
|
|
event->stepCount = 0;
|
|
*eHandle = event;
|
|
debugEvent(event, "PxnProxyOpStart");
|
|
return ncclSuccess;
|
|
}
|
|
|
|
if (eventBase->type == ncclProfileColl) {
|
|
struct collective* parent = (struct collective *)eDescr->parentObj;
|
|
int channelId = eDescr->proxyOp.channelId;
|
|
struct proxyOp* event = &parent->op[channelId][parent->nProxyOps[channelId]++];
|
|
|
|
event->type = ncclProfileProxyOp;
|
|
event->channelId = channelId;
|
|
event->pid = eDescr->proxyOp.pid;
|
|
event->rank = eDescr->rank;
|
|
event->peer = eDescr->proxyOp.peer;
|
|
event->nSteps = eDescr->proxyOp.nSteps;
|
|
event->chunkSize = eDescr->proxyOp.chunkSize;
|
|
event->isSend = eDescr->proxyOp.isSend;
|
|
event->parent = eventBase;
|
|
event->startTs = gettime() - startTime;
|
|
event->stepCount = 0;
|
|
*eHandle = event;
|
|
__atomic_fetch_add(&parent->base.refCount, 1, __ATOMIC_RELAXED);
|
|
debugEvent(event, "ProxyOpStart");
|
|
} else { // ncclProfileP2p
|
|
struct p2p* parent = (struct p2p *)eDescr->parentObj;
|
|
int channelId = eDescr->proxyOp.channelId;
|
|
struct proxyOp* event = &parent->op[channelId];
|
|
event->type = ncclProfileProxyOp;
|
|
event->channelId = channelId;
|
|
event->pid = eDescr->proxyOp.pid;
|
|
event->rank = eDescr->rank;
|
|
event->peer = eDescr->proxyOp.peer;
|
|
event->nSteps = eDescr->proxyOp.nSteps;
|
|
event->chunkSize = eDescr->proxyOp.chunkSize;
|
|
event->isSend = eDescr->proxyOp.isSend;
|
|
event->parent = eventBase;
|
|
event->startTs = gettime() - startTime;
|
|
event->stepCount = 0;
|
|
*eHandle = event;
|
|
__atomic_fetch_add(&parent->base.refCount, 1, __ATOMIC_RELAXED);
|
|
debugEvent(event, "ProxyOpStart");
|
|
}
|
|
} else if (eDescr->type == ncclProfileProxyStep) {
|
|
// the parent might be null if we run out of events
|
|
struct proxyOp* parent = (struct proxyOp *)eDescr->parentObj;
|
|
if (parent == NULL) return ncclSuccess;
|
|
|
|
int s = parent->stepCount++ % MAX_STEPS;
|
|
struct proxyStep* event = &parent->step[s];
|
|
event->type = ncclProfileProxyStep;
|
|
event->state = 0;
|
|
event->step = eDescr->proxyStep.step;
|
|
event->parent = parent;
|
|
event->isSend = parent->isSend;
|
|
event->startTs = gettime() - startTime;
|
|
event->nNetEvents = 0;
|
|
*eHandle = event;
|
|
debugEvent(event, "ProxyStepStart");
|
|
} else if (eDescr->type == ncclProfileKernelCh) {
|
|
struct taskEventBase* eventBase = (struct taskEventBase *)eDescr->parentObj;
|
|
if (eventBase == NULL) return ncclSuccess;
|
|
if (eventBase->type == ncclProfileColl) {
|
|
struct collective* parent = (struct collective *)eDescr->parentObj;
|
|
struct kernelCh* event = &parent->kernel[eDescr->kernelCh.channelId];
|
|
event->type = ncclProfileKernelCh;
|
|
event->channelId = eDescr->kernelCh.channelId;
|
|
event->startGpuClk = eDescr->kernelCh.pTimer;
|
|
event->parent = eventBase;
|
|
event->startTs = gettime() - startTime;
|
|
*eHandle = event;
|
|
__atomic_fetch_add(&parent->base.refCount, 1, __ATOMIC_RELAXED);
|
|
debugEvent(event, "KernelChStart");
|
|
} else { // ncclProfileP2p
|
|
struct p2p* parent = (struct p2p *)eDescr->parentObj;
|
|
struct kernelCh* event = &parent->kernel[eDescr->kernelCh.channelId];
|
|
event->type = ncclProfileKernelCh;
|
|
event->channelId = eDescr->kernelCh.channelId;
|
|
event->startGpuClk = eDescr->kernelCh.pTimer;
|
|
event->parent = eventBase;
|
|
event->startTs = gettime() - startTime;
|
|
*eHandle = event;
|
|
__atomic_fetch_add(&parent->base.refCount, 1, __ATOMIC_RELAXED);
|
|
debugEvent(event, "KernelChStart");
|
|
}
|
|
} else if (eDescr->type == ncclProfileNetPlugin) {
|
|
struct proxyStep* parent = (struct proxyStep *)eDescr->parentObj;
|
|
if (parent == NULL) return ncclSuccess;
|
|
|
|
int64_t pluginId = eDescr->netPlugin.id;
|
|
int64_t type = pluginId & NCCL_PROFILER_NET_TYPE_MASK;
|
|
int64_t ver = pluginId & NCCL_PROFILER_NET_VER_MASK;
|
|
if (type == NCCL_PROFILER_NET_TYPE_IB) {
|
|
if (ver == 1) {
|
|
ncclProfilerNetIbDescr_v1_t* descr = (ncclProfilerNetIbDescr_v1_t *)eDescr->netPlugin.data;
|
|
struct netPlugin* event = parent->net + __atomic_fetch_add(&parent->nNetEvents, 1, __ATOMIC_RELAXED);
|
|
event->type = ncclProfileNetPlugin;
|
|
event->pluginType = type;
|
|
event->pluginVer = ver;
|
|
if (descr->type == ncclProfileQp) {
|
|
event->pluginEvent = ncclProfileQp;
|
|
event->qp.device = descr->qp.device;
|
|
event->qp.wr_id = descr->qp.wr_id;
|
|
event->qp.opcode = descr->qp.opcode;
|
|
event->qp.qpNum = descr->qp.qpNum;
|
|
event->qp.length = descr->qp.length;
|
|
}
|
|
event->startTs = gettime() - startTime;
|
|
*eHandle = event;
|
|
debugEvent(event, "NetPluginStart");
|
|
}
|
|
} else if (type == NCCL_PROFILER_NET_TYPE_SOCK) {
|
|
if (ver == 1) {
|
|
ncclProfilerNetSockDescr_v1_t* descr = (ncclProfilerNetSockDescr_v1_t *)eDescr->netPlugin.data;
|
|
struct netPlugin* event = parent->net + __atomic_fetch_add(&parent->nNetEvents, 1, __ATOMIC_RELAXED);
|
|
event->type = ncclProfileNetPlugin;
|
|
event->pluginType = type;
|
|
event->pluginVer = ver;
|
|
if (descr->type == ncclProfileSocket) {
|
|
event->pluginEvent = ncclProfileSocket;
|
|
event->sock.fd = descr->sock.fd;
|
|
event->sock.op = descr->sock.op;
|
|
event->sock.length = descr->sock.length;
|
|
}
|
|
event->startTs = gettime() - startTime;
|
|
*eHandle = event;
|
|
debugEvent(event, "NetPluginStart");
|
|
}
|
|
}
|
|
}
|
|
return ncclSuccess;
|
|
}
|
|
|
|
void updateEvent(void* handle) {
|
|
uint8_t type = *(uint8_t *)handle;
|
|
if (type == ncclProfileGroup) {
|
|
struct group* event = (struct group *)handle;
|
|
if (__atomic_sub_fetch(&event->refCount, 1, __ATOMIC_RELAXED) == 0) {
|
|
event->stopTs = gettime() - startTime;
|
|
// return group event to the pool
|
|
__atomic_fetch_add(&event->ctx->groupPoolBase, 1, __ATOMIC_RELAXED);
|
|
}
|
|
debugEvent(event, "GroupStop");
|
|
} else if (type == ncclProfileColl) {
|
|
struct collective* event = (struct collective *)handle;
|
|
if (__atomic_sub_fetch(&event->base.refCount, 1, __ATOMIC_RELAXED) == 0) {
|
|
event->base.stopTs = gettime() - startTime;
|
|
debugEvent(event, "CollStop");
|
|
updateEvent(event->base.parent);
|
|
return;
|
|
}
|
|
debugEvent(event, "CollStop");
|
|
} else if (type == ncclProfileP2p) {
|
|
struct p2p* event = (struct p2p *)handle;
|
|
if (__atomic_sub_fetch(&event->base.refCount, 1, __ATOMIC_RELAXED) == 0) {
|
|
event->base.stopTs = gettime() - startTime;
|
|
debugEvent(event, "P2pStop");
|
|
updateEvent(event->base.parent);
|
|
return;
|
|
}
|
|
debugEvent(event, "P2pStop");
|
|
} else if (type == ncclProfileProxyOp) {
|
|
struct proxyOp* event = (struct proxyOp *)handle;
|
|
event->stopTs = gettime() - startTime;
|
|
if (event->pid != pid) {
|
|
// only for proxyOps that don't have a parent collective/p2p (i.e., PXN)
|
|
int done = __atomic_add_fetch(&detachPoolDone, 1, __ATOMIC_RELAXED);
|
|
if (done == detachPoolSize) {
|
|
// reset the event completed (done) counter
|
|
__atomic_store_n(&detachPoolDone, 0, __ATOMIC_RELAXED);
|
|
// update the base pointer to the top of the pool
|
|
int index = __atomic_load_n(&detachPoolIndex, __ATOMIC_RELAXED);
|
|
__atomic_store_n(&detachPoolBase, index, __ATOMIC_RELAXED);
|
|
}
|
|
debugEvent(event, "ProxyOpStop");
|
|
return;
|
|
}
|
|
updateEvent(event->parent);
|
|
debugEvent(event, "ProxyOpStop");
|
|
} else if (type == ncclProfileProxyStep) {
|
|
struct proxyStep* event = (struct proxyStep *)handle;
|
|
event->stopTs = gettime() - startTime;
|
|
debugEvent(event, "ProxyStepStop");
|
|
} else if (type == ncclProfileProxyCtrl) {
|
|
struct proxyCtrl* event = (struct proxyCtrl *)handle;
|
|
event->stopTs = gettime() - startTime;
|
|
debugEvent(event, "ProxyCtrlStop");
|
|
} else if (type == ncclProfileKernelCh) {
|
|
struct kernelCh* event = (struct kernelCh *)handle;
|
|
event->stopTs = gettime() - startTime;
|
|
updateEvent(event->parent);
|
|
debugEvent(event, "KernelChStop");
|
|
} else if (type == ncclProfileNetPlugin) {
|
|
struct netPlugin* event = (struct netPlugin *)handle;
|
|
event->stopTs = gettime() - startTime;
|
|
debugEvent(event, "NetPluginStop");
|
|
}
|
|
}
|
|
|
|
__hidden ncclResult_t exampleProfilerStopEvent(void* eHandle) {
|
|
// the event handle might be null if we run out of events
|
|
if (eHandle == NULL) return ncclSuccess;
|
|
|
|
uint8_t type = *(uint8_t *)eHandle;
|
|
if (type == ncclProfileGroup) {
|
|
// stopping the group event in NCCL core does not
|
|
// mean the group has completed. It means the group
|
|
// was submitted/enqueued so we need to keep the event open
|
|
struct group* event = (struct group *)eHandle;
|
|
event->stopTs = gettime() - startTime;
|
|
return ncclSuccess;
|
|
} else if (type == ncclProfileColl) {
|
|
// stopping the collective event in NCCL core does not
|
|
// mean the collective has completed. It means the collective
|
|
// was submitted/enqueued so we need to keep the event open
|
|
struct collective* event = (struct collective *)eHandle;
|
|
event->base.stopTs = gettime() - startTime;
|
|
return ncclSuccess;
|
|
} else if (type == ncclProfileP2p) {
|
|
// stopping the p2p event in NCCL core does not
|
|
// mean the p2p has completed. It means the p2p
|
|
// was submitted/enqueued so we need to keep the event open
|
|
struct p2p* event = (struct p2p *)eHandle;
|
|
event->base.stopTs = gettime() - startTime;
|
|
return ncclSuccess;
|
|
}
|
|
|
|
updateEvent(eHandle);
|
|
return ncclSuccess;
|
|
}
|
|
|
|
__hidden ncclResult_t exampleProfilerRecordEventState(void* eHandle, ncclProfilerEventState_t eState, ncclProfilerEventStateArgs_t* eStateArgs) {
|
|
// the event handle might be null if we run out of events
|
|
if (eHandle == NULL) return ncclSuccess;
|
|
|
|
uint8_t type = *(uint8_t *)eHandle;
|
|
if (type == ncclProfileProxyOp) {
|
|
struct proxyOp* event = (struct proxyOp *)eHandle;
|
|
if (eState == ncclProfilerProxyOpInProgress_v4) {
|
|
event->progrTs = gettime() - startTime;
|
|
}
|
|
} else if (type == ncclProfileProxyStep) {
|
|
struct proxyStep* event = (struct proxyStep *)eHandle;
|
|
struct proxyOp* parent = event->parent;
|
|
switch (eState) {
|
|
case ncclProfilerProxyStepSendGPUWait:
|
|
event->timestamp[PROXY_STEP_SEND_GPU_WAIT] = gettime() - startTime;
|
|
break;
|
|
case ncclProfilerProxyStepSendPeerWait_v4:
|
|
// do not update step event if in SendPeerWait
|
|
if (event->state == ncclProfilerProxyStepSendPeerWait_v4) break;
|
|
event->timestamp[PROXY_STEP_SEND_PEER_WAIT] = gettime() - startTime;
|
|
event->state = ncclProfilerProxyStepSendPeerWait_v4;
|
|
break;
|
|
case ncclProfilerProxyStepSendWait:
|
|
event->timestamp[PROXY_STEP_SEND_WAIT] = gettime() - startTime;
|
|
parent->transSize += eStateArgs->proxyStep.transSize;
|
|
break;
|
|
case ncclProfilerProxyStepRecvWait:
|
|
event->timestamp[PROXY_STEP_RECV_WAIT] = gettime() - startTime;
|
|
break;
|
|
case ncclProfilerProxyStepRecvFlushWait:
|
|
event->timestamp[PROXY_STEP_RECV_FLUSH_WAIT] = gettime() - startTime;
|
|
parent->transSize += eStateArgs->proxyStep.transSize;
|
|
break;
|
|
case ncclProfilerProxyStepRecvGPUWait:
|
|
event->timestamp[PROXY_STEP_RECV_GPU_WAIT] = gettime() - startTime;
|
|
break;
|
|
}
|
|
} else if (type == ncclProfileProxyCtrl) {
|
|
struct proxyCtrl* event = (struct proxyCtrl *)eHandle;
|
|
if (eState == ncclProfilerProxyCtrlAppendEnd) {
|
|
event->appended = eStateArgs->proxyCtrl.appendedProxyOps;
|
|
}
|
|
event->state = eState;
|
|
} else if (type == ncclProfileKernelCh) {
|
|
struct kernelCh* event = (struct kernelCh *)eHandle;
|
|
if (eState == ncclProfilerKernelChStop) {
|
|
event->stopGpuClk = eStateArgs->kernelCh.pTimer;
|
|
}
|
|
}
|
|
debugEvent(eHandle, "RecordEventState");
|
|
return ncclSuccess;
|
|
}
|
|
|
|
ncclProfiler_t ncclProfiler_v4 = {
|
|
"Example-profiler",
|
|
exampleProfilerInit,
|
|
exampleProfilerStartEvent,
|
|
exampleProfilerStopEvent,
|
|
exampleProfilerRecordEventState,
|
|
exampleProfilerFinalize,
|
|
};
|
|
|
|
int exampleProfilerStart(int eActivationMask) {
|
|
if (__atomic_load_n(&initialized, __ATOMIC_RELAXED)) {
|
|
__atomic_store_n(eActivationMaskPtr, eActivationMask, __ATOMIC_RELAXED);
|
|
}
|
|
return ncclSuccess;
|
|
}
|
|
|
|
int exampleProfilerStop(void) {
|
|
if (__atomic_load_n(&initialized, __ATOMIC_RELAXED)) {
|
|
__atomic_store_n(eActivationMaskPtr, 0, __ATOMIC_RELAXED);
|
|
}
|
|
return ncclSuccess;
|
|
}
|