diff --git a/ext-profiler/README.md b/ext-profiler/README.md new file mode 100644 index 0000000..7ef44b2 --- /dev/null +++ b/ext-profiler/README.md @@ -0,0 +1,318 @@ +# NCCL Profiler Plugin Documentation + +This page describes the NCCL Profiler plugin API and how to implement a profiler plugin for NCCL. + +# Overview + +To allow NCCL to better integrate with DL frameworks, NCCL v2.23 introduced a profiler plugin +interface. Any NCCL user can write profiler plugins to extract performance data from NCCL and +use it for debugging and analysis. + +Similarly to other plugins (e.g., network plugin), the profiler plugins come as a shared library +called `libnccl-profiler.so`. That shared library contains one or more implementations of the +NCCL PROFILER API, in the form of versioned structs, filled with pointers to all required +functions. + +# Plugin architecture + +## Plugin name and supporting multiple profiler plugins + +When NCCL is initialized, it will look for a `libnccl-profiler.so` library and dynamically load +it, then look for symbols inside the library. + +The `NCCL_PROFILER_PLUGIN` environment variable allows multiple plugins to coexist. If set, NCCL +will look for a library with a name of `libnccl-profiler-${NCCL_PROFILER_PLUGIN}.so`. It is therefore +advised to name the library following that pattern, with a symlink pointing `libnccl-profiler.so` +to `libnccl-profiler-${NCCL_PROFILER_PLUGIN}.so`. That way, if there are multiple plugins in the +path, setting `NCCL_PROFILER_PLUGIN` will allow users to select the right plugin. Alternatively, +the user can also set `NCCL_PROFILER_PLUGIN` to the pathname of the `libnccl-profiler.so` library. + +## Struct versioning + +Once a library is found, NCCL will look for a symbol named `ncclProfiler_vX`, with `X` increasing +over time. The versioning ensures that the plugin and the NCCL core are compatible. + +Plugins are encouraged to provide multiple of those symbols, implementing multiple versions of the +NCCL PROFILER API, so that the same plugin can be compiled and support a wide range of NCCL versions. + +Conversely, and to ease transition, NCCL can choose to support different plugin versions, looking +for the latest ncclProfiler struct version, but also looking for older ones so that older plugins +would still work. + +## Headers management + +To help users build plugins effortlessly, plugins should copy the `ncclProfiler_vX` definitions +they support to their internal includes. An example is shown in `ext-profiler/example` where we +keep all headers in the `nccl/` directory and provide thin layers to implement old version on top +of newer ones. + +The `nccl/` directory is populated with `profiler_vX.h` files extracting all relevant definitions +from old API versions. It also provides error codes in `err.h`. + +# API (v2) + +Below is the main `ncclProfiler_v2` struct. Each function is explained in later sections. + +``` +typedef struct { + const char* name; + + // init - initialize the profiler plugin + // Input + // - context : opaque profiler context object for separating profiler behavior across comms + // Output + // - eActivationMask: bitmask of active events set by the plugin + ncclResult_t (*init)(void** context, int* eActivationMask); + + // startEvent - initialize and start a new event for the supplied event descriptor inside the eventset + // Input + // - context: opaque profiler context object + // - eDescr : pointer to ncclProfilerEventDescr_t object + // Output + // - eHandle: return event handle for supplied event descriptor object + ncclResult_t (*startEvent)(void* context, void** eHandle, ncclProfilerEventDescr_v2_t* eDescr); + + // stopEvent - stop/finalize an event inside and event set + // Input + // - eHandle: handle to event object + ncclResult_t (*stopEvent)(void* eHandle); + + // recordEventState - record event state transitions and event attribute updates + // Input + // - eHandle : handle to event object created through startEvent + // - eStateArgs: optional argument used to capture event attribute updates associated with the state transition + // - eState : event state transition + ncclResult_t (*recordEventState)(void* eHandle, ncclProfilerEventState_v2_t eState, ncclProfilerEventStateArgs_v2_t* eStateArgs); + + // finalize - finalize the profiler plugin + // Input + // - context: opaque profiler context object + ncclResult_t (*finalize)(void* context); +} ncclProfiler_v2_t; +``` + +## Error codes + +As rule of thumb, profiler generated errors should not be propagated to NCCL and alter its normal +functioning. Nevertheless, the profiler interface returns NCCL error codes, in case any need for +them arises in the future. For now, any profiler interface call should only return `ncclSuccess`. +The only exception is `init` that can return an error so that NCCL can disable the plugin. + +## Operation overview + +NCCL will call the `init` function first for every new communicator that is initialized. The profiler +returns an opaque context handle that is used to isolate profiler instances across communicators. +Similarly, NCCL will call `finalize` to destroy the profiler context, thus freeing resources. + +The NCCL core code is instrumented with calls to `startEvent`, `stopEvent` and `recordEventState`. +These are used to start, stop and update events in the profiler, respectively. + +## API Functions + +### Initialization + +#### name + +The `name` field should point to a character string with the name of the profiler plugin. This will +be used for all logging, especially when `NCCL_DEBUG=INFO` is set. + +#### init + +As soon as NCCL finds the plugin and the correct ncclProfiler symbol, it calls its `init` function. +This allows the plugin to initialize its internal context, used during profiling of NCCL events. +If the `init` function does not return `ncclSuccess`, NCCL disables the plugin. + +#### finalize + +When the profiler is no longer needed, a call to `finalize` destroys the profiler context and frees +up resources. + +### Profiling + +#### startEvent + +When NCCL needs to start profiling a new event it calls `startEvent`. `startEvent` takes the profiler +context, previously created by `init`, an event descriptor of type `ncclProfilerEventDescr_t` and +returns an opaque profiler event handle that can be passed to other profiler functions, as discussed +later in the document. + + +The event descriptor contains all the event metadata. Every event type has its own descriptor. Below +is the `ncclProfilerEventDescr_t` struct. + +``` +typedef struct { + uint8_t type; // event type (e.g., ncclProfileGroup, ncclProfileColl, ...) + void* parentObj; // pointer to parent event used to expose the event hierarchy to the profiler + int rank; // rank that generated the event + union { + struct { // collective events metadata + const char* name; // string containing name of the communicator + uint64_t commHash; // unique hash/id for the communicator + uint64_t seqNumber; // sequence number of this collective operation in the communicator + const char* func; // string containing name of the collective + void const* sendBuff; // address of send buffer + void* recvBuff; // address of recv buffer + size_t count; // data count + int root; // root rank + const char* datatype; // string containing the name of the datatype + size_t trafficBytes; // number of transfer bytes + uint8_t nMaxChannels; // max number of channels for this collective + uint8_t nWarps; // number of GPU warps for this collective + const char* algo; // string containing name of the algorithm for this collective + const char* proto; // string containing name of the protocol for this collective + } coll; + + struct { // point-to-point events metadata + const char* name; + uint64_t commHash; + const char* func; + void* buff; + const char* datatype; + size_t count; + int peer; // peer rank for this point-to-point + } p2p; + + struct { // proxyOp events metadata + pid_t pid; // process id that generated the associated `ncclProxyOp` object + uint8_t channelId; // id of the channel used by the associated `ncclProxyOp` object + int peer; // peer rank + int nSteps; // number of network transfers/steps required by the `ncclProxyOp` + int chunkSize; // chunk size for this `ncclProxyOp` + int isSend; // set to 1 for sends and 0 for recvs + } proxyOp; + + struct { // proxyStep events metadata + int step; // individual step in `ncclProxyOp` + } proxyStep; + }; +} ncclProfilerEventDescr_v2_t; +``` + +NCCL defines the following events: `ncclProfileGroup`, `ncclProfileColl`, `ncclProfileP2p`, +`ncclProfileProxyOp`, `ncclProfileProxyStep`, and `ncclProfileProxyCtrl`. + +#### stopEvent + +`stopEvent` takes the event handle returned by `startEvent` to stop the event. After the event +has been stopped the handle can no longer be used with other profiler calls. Using the event +handle after `eventStop` is undefined behavior. + +#### recordEventState + +Some events can only be started and stopped. For example, `ncclProfileGroup`, `ncclProfileColl`, +`ncclProfileP2p`, cannot be updated through calls to `recordEventState`. + +`ncclProfileProxyOp`, `ncclProfileProxyStep` and `ncclProfileProxyCtrl` can be updated through +calls to `recordEventState`. + +The state of proxy generated events can be updated, along with event attributes, using +`recordEventState`. These events can go through several states during their lifecycle. +The list of supported states for the proxy-defined events is reported below. + +``` +typedef enum { + // ncclProfileProxyOp event states + ncclProfilerProxyOpSendPosted, // state marks the posting of send buffer to GPU for given network transfer/step + ncclProfilerProxyOpSendRemFifoWait, // state marks the waiting of CTS credits from peer rank + ncclProfilerProxyOpSendTransmitted, // state marks the sending of network transfer/step to peer rank + ncclProfilerProxyOpSendDone, // state marks the ending of network transfer/step + ncclProfilerProxyOpRecvPosted, // state marks the posting of recv to network for given network transfer/step + ncclProfilerProxyOpRecvReceived, // state marks the recving of network transfer/step from peer rank + ncclProfilerProxyOpRecvTransmitted, // state marks the ending of the network transfer/step + ncclProfilerProxyOpRecvDone, // state marks the consuming of data from GPU + + // ncclProfileProxyStep event states + ncclProfilerProxyStepSendGPUWait, // state marks the waiting of send data from GPU for given network transfer/step + ncclProfilerProxyStepSendWait, // state marks the waiting of send data from network for given network transfer/step + ncclProfilerProxyStepRecvWait, // state marks the waiting of recv data from network for given network transfer/step + ncclProfilerProxyStepRecvFlushWait, // state marks the waiting of recv data flush to GPU for given network transfer/step + ncclProfilerProxyStepRecvGPUWait, // state marks the waiting of recv data consumption from GPU for given network transfer/step + + // ncclProfileProxyCtrl event states + ncclProfilerProxyCtrlIdle, // state marks proxy progress thread idle + ncclProfilerProxyCtrlActive, // state marks proxy progress thread active + ncclProfilerProxyCtrlSleep, // state marks proxy progress thread sleeping + ncclProfilerProxyCtrlWakeup, // state marks proxy progress thread waking up + ncclProfilerProxyCtrlAppend, // state marks append of new network work item begin + ncclProfilerProxyCtrlAppendEnd, // state marks append of new network work item end +} ncclProfilerEventState_v2_t; +``` + +`ncclProfileProxyOp` events are generated by the proxy progress thread while it is processing +network requests for the GPU kernel. ProxyOp events are generated for every active channel and +provide a summary of the activity of the proxy progress thread for that channel. + +`ncclProfileProxyStep` events are generated by the proxy progress thread while it is processing +network requests for the GPU kernel. ProxyStep events describe individual network transfer in +the channel. Thus, they provide a more fine-grained view w.r.t. ProxyOp events. + +`ncclProfileProxyCtrl` events are generated by the proxy progress thread while it is not processing +network requests for the GPU kernel. This includes everything else that the proxy thread might be +doing, including appending new `ncclProxyOp` objects to the list of work elements to process. + +State transitions for the events described can also come with event attribute updates. For this +reason the profiler defines the `ncclProfilerEventStateArgs_t` struct, reported below. + +``` +typedef union { + struct { // attributes to update for ncclProfileProxyOp events + size_t transSize; // data transferred thus far + int steps; // network transfer/steps processed thus far + } proxyOp; + + struct { // attributes to update for ncclProfileProxyCtrl + int appendedProxyOps; // number of appended proxy ops thus far + } proxyCtrl; +} ncclProfilerEventStateArgs_v2_t; +``` + +The example profiler in `ext-profiler/example` contains details on how to capture and use the events above. + +### Event hierarchy + +NCCL core events (reported above) are organized into a hierarchy as reported below: + +``` +Group event + | + +- Collective event + | | + | +- ProxyOp event + | | + | +- ProxyStep event + | + +- Point-to-point event + | + +- ProxyOp event + | + +- ProxyStep event + +ProxyCtrl event +``` + +# Profiler instrumentation and logging + +## Profiling of collective and p2p operations + +The NCCL code is instrumented with profiler callbacks at different levels to capture start/stop of groups, +collective and point-to-point operations, as well as proxy progress activity. Due to the asynchronous nature +of NCCL operations, events associated to collective and point-to-point operations are not easy to delimit +precisely. For example, without both proxy and/or kernel activity it is impossible for the profiler to +figure out when a collective operation completes. Therefore, `stopEvent` for collectives simply indicates to +the profiler that the collective has been enqueued. The profiler can leverage proxy event information, if +these are enabled, to estimate when the collective ends. In this case, the profiler can look at the `stopEvent` +call of the last `ncclProfileProxyOp` event to mark the completion of the associated collective event. This +can be achieved by reference counting the collective event and letting calls to `startEvent` and `stopEvent` +increment and decrement the reference counter, respectively. + +## PXN + +PXN causes some proxy operations to be processed in a remote proxy thread that differs from the one that +generated the operation. When this happens, the event hierarchy reported above breaks. Because the +profiler can use the hierarchy information, provided by NCCL in the event descriptor, to dereference the +parent event during `startEvent`, the remote proxy thread must be in the same address space of the proxy +thread originating the operation. To avoid the profiler instance in the remote proxy address space to +dereference a pointer from another address space the event descriptor includes the PID of the originator. +The profiler plugin needs to check that the originator PID matches the local PID before dereferencing the +parent event. diff --git a/ext-profiler/example/README.md b/ext-profiler/example/README.md new file mode 100644 index 0000000..d98e58f --- /dev/null +++ b/ext-profiler/example/README.md @@ -0,0 +1,239 @@ +# NCCL Example Profiler Plugin Usage + +This page describes how to use the NCCL example profiler plugin + +# Overview + +The example profiler plugin implements the NCCL profiler plugin API introduced in NCCL v2.23. The API +defines a set of events and data structures that NCCL uses to share event information with profiler +plugins. The user can control what events are instrumented by NCCL and when traces collected by the +profiler should be dumped through environment variables, as described in the rest of the document. +The user can also control other profiler parameters that alter its behavior. For example, users can +change the size of the event window the profiler keeps track of. + +## Building the profiler plugin + +To use the example plugin, just type `make`. You will need a NCCL build's include directory present. +You can override `NCCL_HOME` to where the NCCL installation is on your system. + +## Using the profiler plugin + +1. Add the directory of this profiler plugin to your `LD_LIBRARY_PATH` or set the `NCCL_PROFILER_PLUGIN`, + as documented in `ext-profiler/README.md`. + +2. Set `NCCL_PROFILE_EVENT_MASK` bitmask to specify the NCCL events you want to instrument. By + default, all collectives and send/recv operations will be traced. For more details about the event + representation used by the profiler refer to `ext-profiler/README.md`. + + As an example, setting: + + `NCCL_PROFILE_EVENT_MASK` to 1 (`ncclProfileGroup`) | 2 (`ncclProfileColl`) | 8 (`ncclProfileProxyOp`) + + enables the profiling of the group, the collective and the proxy op events. The same events can be + expressed more concisely by setting `NCCL_PROFILE_EVENT_MASK` to 8 (`ncclProfileProxyOp`). Indeed, + in NCCL all the events above (in the event hierarchy) the one requested are also captured. The advantage + is that the profiler can easily correlate events that belong to the same NCCL operation and present + them accordingly. + +3. Set `NCCL_PROFILE_DUMP_FILE` to the name of the dump file for the collected traces. A file named + ${NCCL_PROFILE_DUMP_FILE}-hostname-tid.txt is created. Profiler traces are saved using the chrome + event format (more precisely, using asynchronous events). + +4. If you set the dump file variable, type chrome://tracing on your chromium browser search bar and + open the created dump file to visualize the traces. + +# Changing the profiler memory pool sizes + +The example profiler uses separate memory pools for different types of events. The size of these memory +pools (i.e., the # events) determines the number of events that the profiler can keep track of at the +same time. When NCCL requests a new event (e.g., collective event) to profile a `ncclAllReduce` +operation, by calling `startEvent`, the profiler searches in the collective pool for a free event. If it +finds one, it marks it as in use and returns the handle to NCCL. If the pool is completely used the +profiler returns `NULL` to NCCL and ignores all the following NCCL profiler calls for the `NULL` event +handle. When the `ncclAllReduce` has been processed, NCCL calls `stopEvent` with the previosly returned +event handle. The profiler has a total of 5 memory pools. + +The group, collective and p2p pools contain objects for the corresponding events. The `ProxyCtrl` pool +contains objects for `ProxyCtrl` events and the `ProxyDetach` pool contains objects for `ProxyOp` events +generated by remote proxies. A list of pools and their size is reported below: + +- `NCCL_PROFILE_GROUP_POOL_SIZE` (16) +- `NCCL_PROFILE_COLL_POOL_SIZE` (16) +- `NCCL_PROFILE_P2P_POOL_SIZE` (1024) +- `NCCL_PROFILE_PROXY_CTRL_POOL_SIZE` (16) +- `NCCL_PROFILE_PROXY_DETACH_POOL_SIZE` (128) + +Remote proxy operations are generated when PXN is in use. Refer to this article for more information +about PXN and how it works: +https://developer.nvidia.com/blog/doubling-all2all-performance-with-nvidia-collective-communication-library-2-12/ + +# Reported events + +The example profiler generates traces using the json format. An example of trace is reported below: + +``` +[ +{"name": "Group", "cat": "GROUP", "ph": "b", "id": 0, "pid": 4157654, "tid": 1, "ts": 764234.611328, "args": {"groupId": 0}}, +{"name": "AllReduce", "cat": "COLL", "ph": "b", "id": 0, "pid": 4157654, "tid": 1, "ts": 764237.294922, "args": {"SeqNum": 0, "CommHash": 673864846479792718, "Rank": 1, "Count": 32768, "Datatype": "ncclFloat32", "Algorithm": "RING", "Protocol": "LL", "nMaxChannels": 2}}, +{"name": "Recv", "cat": "PROXY", "ph": "b", "id": 0, "pid": 4157654, "tid": 1, "ts": 768464.936523, "args": {"Channel": 0, "Peer": 0, "Steps": 14, "ChunkSize": 32768, "transSize": 229376, "POSTED": {"step": 14, "ts": 772020.300781}, "RECEIVED": {"step": 14, "ts": 772196.049805}, "TRANSMITTED": {"step": 14, "ts": 772197.326172}, "DONE": {"step": 14, "ts": 772201.538086}}}, +{"name": "RecvBufferWait", "cat": "NET", "ph": "b", "id": 0, "pid": 4157654, "tid": 1, "ts": 768465.158203, "args": {"Step": 0}}, +{"name": "RecvBufferWait", "cat": "NET", "ph": "e", "id": 0, "pid": 4157654, "tid": 1, "ts": 768477.924805}, +{"name": "RecvWait", "cat": "NET", "ph": "b", "id": 0, "pid": 4157654, "tid": 1, "ts": 768477.924805, "args": {"Step": 0}}, +{"name": "RecvWait", "cat": "NET", "ph": "e", "id": 0, "pid": 4157654, "tid": 1, "ts": 768547.197266}, +{"name": "RecvFlushWait", "cat": "NET", "ph": "b", "id": 0, "pid": 4157654, "tid": 1, "ts": 768547.197266, "args": {"Step": 0}}, +{"name": "RecvFlushWait", "cat": "NET", "ph": "e", "id": 0, "pid": 4157654, "tid": 1, "ts": 768564.174805}, +{"name": "RecvGpuWait", "cat": "NET", "ph": "b", "id": 0, "pid": 4157654, "tid": 1, "ts": 768564.174805, "args": {"Step": 0}}, +{"name": "RecvGpuWait", "cat": "NET", "ph": "e", "id": 0, "pid": 4157654, "tid": 1, "ts": 768568.276367}, +{"name": "RecvBufferWait", "cat": "NET", "ph": "b", "id": 1, "pid": 4157654, "tid": 1, "ts": 768503.604492, "args": {"Step": 1}}, +{"name": "RecvBufferWait", "cat": "NET", "ph": "e", "id": 1, "pid": 4157654, "tid": 1, "ts": 768504.549805}, +{"name": "RecvWait", "cat": "NET", "ph": "b", "id": 1, "pid": 4157654, "tid": 1, "ts": 768504.549805, "args": {"Step": 1}}, +{"name": "RecvWait", "cat": "NET", "ph": "e", "id": 1, "pid": 4157654, "tid": 1, "ts": 769994.490234}, +{"name": "RecvFlushWait", "cat": "NET", "ph": "b", "id": 1, "pid": 4157654, "tid": 1, "ts": 769994.490234, "args": {"Step": 1}}, +{"name": "RecvFlushWait", "cat": "NET", "ph": "e", "id": 1, "pid": 4157654, "tid": 1, "ts": 769995.012695}, +{"name": "RecvGpuWait", "cat": "NET", "ph": "b", "id": 1, "pid": 4157654, "tid": 1, "ts": 769995.012695, "args": {"Step": 1}}, +{"name": "RecvGpuWait", "cat": "NET", "ph": "e", "id": 1, "pid": 4157654, "tid": 1, "ts": 770006.914062}, +{"name": "RecvBufferWait", "cat": "NET", "ph": "b", "id": 2, "pid": 4157654, "tid": 1, "ts": 768506.941406, "args": {"Step": 2}}, +{"name": "RecvBufferWait", "cat": "NET", "ph": "e", "id": 2, "pid": 4157654, "tid": 1, "ts": 768507.435547}, +{"name": "RecvWait", "cat": "NET", "ph": "b", "id": 2, "pid": 4157654, "tid": 1, "ts": 768507.435547, "args": {"Step": 2}}, +{"name": "RecvWait", "cat": "NET", "ph": "e", "id": 2, "pid": 4157654, "tid": 1, "ts": 771452.536133}, +{"name": "RecvFlushWait", "cat": "NET", "ph": "b", "id": 2, "pid": 4157654, "tid": 1, "ts": 771452.536133, "args": {"Step": 2}}, +{"name": "RecvFlushWait", "cat": "NET", "ph": "e", "id": 2, "pid": 4157654, "tid": 1, "ts": 771453.060547}, +{"name": "RecvGpuWait", "cat": "NET", "ph": "b", "id": 2, "pid": 4157654, "tid": 1, "ts": 771453.060547, "args": {"Step": 2}}, +{"name": "RecvGpuWait", "cat": "NET", "ph": "e", "id": 2, "pid": 4157654, "tid": 1, "ts": 771468.458008}, +{"name": "RecvBufferWait", "cat": "NET", "ph": "b", "id": 3, "pid": 4157654, "tid": 1, "ts": 768509.484375, "args": {"Step": 3}}, +{"name": "RecvBufferWait", "cat": "NET", "ph": "e", "id": 3, "pid": 4157654, "tid": 1, "ts": 768510.250000}, +{"name": "RecvWait", "cat": "NET", "ph": "b", "id": 3, "pid": 4157654, "tid": 1, "ts": 768510.250000, "args": {"Step": 3}}, +{"name": "RecvWait", "cat": "NET", "ph": "e", "id": 3, "pid": 4157654, "tid": 1, "ts": 771904.499023}, +{"name": "RecvFlushWait", "cat": "NET", "ph": "b", "id": 3, "pid": 4157654, "tid": 1, "ts": 771904.499023, "args": {"Step": 3}}, +{"name": "RecvFlushWait", "cat": "NET", "ph": "e", "id": 3, "pid": 4157654, "tid": 1, "ts": 771904.991211}, +{"name": "RecvGpuWait", "cat": "NET", "ph": "b", "id": 3, "pid": 4157654, "tid": 1, "ts": 771904.991211, "args": {"Step": 3}}, +{"name": "RecvGpuWait", "cat": "NET", "ph": "e", "id": 3, "pid": 4157654, "tid": 1, "ts": 771910.500000}, +{"name": "Send", "cat": "PROXY", "ph": "b", "id": 1, "pid": 4157654, "tid": 1, "ts": 768482.878906, "args": {"Channel": 0, "Peer": 2, "Steps": 14, "ChunkSize": 32768, "transSize": 229376, "POSTED": {"step": 14, "ts": 771995.675781}, "REM_FIFO_WAIT": {"step": 14, "ts": 772190.692383}, "TRANSMITTED": {"step": 14, "ts": 772191.516602}, "DONE": {"step": 14, "ts": 772208.473633}}}, +{"name": "SendBufferWait", "cat": "NET", "ph": "b", "id": 14, "pid": 4157654, "tid": 1, "ts": 768483.019531, "args": {"Step": 0}}, +{"name": "SendBufferWait", "cat": "NET", "ph": "e", "id": 14, "pid": 4157654, "tid": 1, "ts": 768483.300781}, +{"name": "SendGpuWait", "cat": "NET", "ph": "b", "id": 14, "pid": 4157654, "tid": 1, "ts": 768483.300781, "args": {"Step": 0}}, +{"name": "SendGpuWait", "cat": "NET", "ph": "e", "id": 14, "pid": 4157654, "tid": 1, "ts": 769594.615234}, +{"name": "SendWait", "cat": "NET", "ph": "b", "id": 14, "pid": 4157654, "tid": 1, "ts": 769594.615234, "args": {"Step": 0}}, +{"name": "SendWait", "cat": "NET", "ph": "e", "id": 14, "pid": 4157654, "tid": 1, "ts": 769618.889648}, +{"name": "SendBufferWait", "cat": "NET", "ph": "b", "id": 15, "pid": 4157654, "tid": 1, "ts": 768505.083008, "args": {"Step": 1}}, +{"name": "SendBufferWait", "cat": "NET", "ph": "e", "id": 15, "pid": 4157654, "tid": 1, "ts": 768505.163086}, +{"name": "SendGpuWait", "cat": "NET", "ph": "b", "id": 15, "pid": 4157654, "tid": 1, "ts": 768505.163086, "args": {"Step": 1}}, +{"name": "SendGpuWait", "cat": "NET", "ph": "e", "id": 15, "pid": 4157654, "tid": 1, "ts": 769610.555664}, +{"name": "SendWait", "cat": "NET", "ph": "b", "id": 15, "pid": 4157654, "tid": 1, "ts": 769610.555664, "args": {"Step": 1}}, +{"name": "SendWait", "cat": "NET", "ph": "e", "id": 15, "pid": 4157654, "tid": 1, "ts": 769622.517578}, +{"name": "SendBufferWait", "cat": "NET", "ph": "b", "id": 16, "pid": 4157654, "tid": 1, "ts": 768507.937500, "args": {"Step": 2}}, +{"name": "SendBufferWait", "cat": "NET", "ph": "e", "id": 16, "pid": 4157654, "tid": 1, "ts": 768508.017578}, +{"name": "SendGpuWait", "cat": "NET", "ph": "b", "id": 16, "pid": 4157654, "tid": 1, "ts": 768508.017578, "args": {"Step": 2}}, +{"name": "SendGpuWait", "cat": "NET", "ph": "e", "id": 16, "pid": 4157654, "tid": 1, "ts": 770002.129883}, +{"name": "SendWait", "cat": "NET", "ph": "b", "id": 16, "pid": 4157654, "tid": 1, "ts": 770002.129883, "args": {"Step": 2}}, +{"name": "SendWait", "cat": "NET", "ph": "e", "id": 16, "pid": 4157654, "tid": 1, "ts": 770013.848633}, +{"name": "SendBufferWait", "cat": "NET", "ph": "b", "id": 17, "pid": 4157654, "tid": 1, "ts": 768510.742188, "args": {"Step": 3}}, +{"name": "SendBufferWait", "cat": "NET", "ph": "e", "id": 17, "pid": 4157654, "tid": 1, "ts": 768510.822266}, +{"name": "SendGpuWait", "cat": "NET", "ph": "b", "id": 17, "pid": 4157654, "tid": 1, "ts": 768510.822266, "args": {"Step": 3}}, +{"name": "SendGpuWait", "cat": "NET", "ph": "e", "id": 17, "pid": 4157654, "tid": 1, "ts": 771461.563477}, +{"name": "SendWait", "cat": "NET", "ph": "b", "id": 17, "pid": 4157654, "tid": 1, "ts": 771461.563477, "args": {"Step": 3}}, +{"name": "SendWait", "cat": "NET", "ph": "e", "id": 17, "pid": 4157654, "tid": 1, "ts": 771469.171875}, + ... [ trace truncated for brevity ] +{"name": "AllReduce", "cat": "COLL", "ph": "e", "id": 0, "pid": 4157654, "tid": 1, "ts": 772209.317383}, +{"name": "Group", "cat": "GROUP", "ph": "e", "id": 0, "pid": 4157654, "tid": 1, "ts": 772209.418945}, +{}] +``` + +Details about the fields used in the trace can be found at this link: +https://docs.google.com/document/d/1CvAClvFfyA5R-PhYUmn5OOQtYMH4h6I0nSsKchNAySU/preview?tab=t.0#heading=h.yr4qxyxotyw + +The trace above is obtained by running a `ncclAllReduce` operation on 8 GPUs, communicating with each other through +the network interface. The `Group` event encloses all traces that are related to the single `ncclAllReduce` call. +(Note that for single collective invocations, where there are no explicit group calls, NCCL creates a group with only +one collective and this is what is presented in the traces above). + + +The `AllReduce` event encloses traces for the proxy operation associated to the `ncclAllReduce` operation. The `args` +field in the traces contains NCCL specific information (aside from the chrome trace event format). + +## AllReduce trace + +The `AllReduce` entry presents information about the `ncclAllReduce` operation. It contains the following info in the args field: + +- seqNum : sequential number of the collective in the communicator (every collective type has its own sequence number in the communicator) +- commHash : communicator unique identifier +- rank : NCCL rank for the ncclAllReduce +- datatype : NCCL datatype +- algorithm : algorithm used to process the ncclAllReduce +- protocol : protocol used to process the ncclAllReduce +- nMaxChannels: max number of channels used to process the ncclAllReduce + +If the proxy events are not active (e.g., the `ncclAllReduce` is intranode) the end timestamp will match the time +consumed by the CPU to launch the collective. For more details refer to `ext-profiler/README.md`, section `Profiling +of collective and p2p operations`. + +### Proxy Send +The `Send` entry presents information about the `ProxyOp` processing in the progress thread. It contains the following +info in the args field: + +- Channel : id of the channel used by this proxy operation to send data to the peer +- Peer : peer rank +- Steps : number of network steps required to transfer transSize bytes to the peer +- ChunkSize : chunk size used by NCCL to pipeline data through the proxy thread +- transSize : bytes transferred across the channel by this proxy operation +- POSTED : struct containing the number of buffer posts to the GPU and the time stamp for the last post +- REM_FIFO_WAIT: struct containing the number of remote buffer waits and the time stamp for the last wait +- TRANSMITTED : struct containing the number of network sends and the time stamp of the last send +- DONE : struct containing the number of network sends completed and the time stamp of the last send completed + +In case of a network problem the POSTED, REM_FIFO_WAIT, TRANSMITTED and DONE might all have partially updated steps, +which could help identify at which point the network problem occurred. + +The Proxy send trace gives a summary of the proxy progress thread activity for the channel. If more details are +needed, these can be obtained by enabling the proxy step event (`ncclProfileProxyStep`). In which case the trace +entries below are also reported by the profiler. + +#### Proxy SendBufferWait + +Presents, for every network step, the time the CPU proxy spends waiting for the channel staging buffer to become available. + +#### Proxy SendGPUWait + +Presents, for every network step, the time the CPU proxy spends waiting for the GPU to provide the data in the staging +buffer. + +#### Proxy SendWait + +Presents, for every network step, the time the CPU proxy spends waiting for the `isend` to complete + +### Proxy Recv + +The `Recv` entry presents information about the `ProxyOp` processing in the progress thread. It contains the following +info in the args field: + +- Channel : id of the channel used by this proxy operation to recv data from the peer +- Peer : peer rank +- Steps : number of network steps required to transfer transSize bytes from the peer +- ChunkSize : chunk size used by NCCL to pipeline data through the proxy thread +- transSize : bytes transferred across the channel by this proxy operation +- POSTED : struct containing the number of recvs posted and the time stamp for the last recv posted +- RECEIVED : struct containing the number of recvs completed and the time stamp for the last recv completed +- TRANSMITTED: struct containing the number of recvs flushed to the GPU memory and the time stamp for the last recv flushed +- DONE : struct containing the number of flush completed and the time stamp for the last flush completed + +The Proxy Recv trace gives a summary of the proxy progress thread activity for the channel. If more details are +needed, these can be obtained by enabling the proxy step event (`ncclProfileProxyStep`). In which case the trace +entries below are also reported by the profiler. + + +#### Proxy RecvBufferWait + +Presents, for every network step, the time the CPU proxy spends waiting for the staging buffer for the channel to +become available. + +#### Proxy RecvWait + +Presents, for every network step, the time the CPU proxy spends waiting for a posted `irecv` to complete + +#### Proxy RecvFlushWait + +Presents, for every network step, the time the CPU proxy spends waitng for the recv data to be flushed to the GPU + +#### Proxy RecvGPUWait + +Presents, for every network step, the time the CPU proxy spends waiting for the GPU to consume the recv data