Add profiler documentation
Add the following files: - ext-profiler/README.md: plugin writed documentation - ext-profiler/example/README.md: example plugin user documentation
This commit is contained in:
parent
dcdc67c40b
commit
d7ccab8b7e
318
ext-profiler/README.md
Normal file
318
ext-profiler/README.md
Normal file
@ -0,0 +1,318 @@
|
|||||||
|
# NCCL Profiler Plugin Documentation
|
||||||
|
|
||||||
|
This page describes the NCCL Profiler plugin API and how to implement a profiler plugin for NCCL.
|
||||||
|
|
||||||
|
# Overview
|
||||||
|
|
||||||
|
To allow NCCL to better integrate with DL frameworks, NCCL v2.23 introduced a profiler plugin
|
||||||
|
interface. Any NCCL user can write profiler plugins to extract performance data from NCCL and
|
||||||
|
use it for debugging and analysis.
|
||||||
|
|
||||||
|
Similarly to other plugins (e.g., network plugin), the profiler plugins come as a shared library
|
||||||
|
called `libnccl-profiler.so`. That shared library contains one or more implementations of the
|
||||||
|
NCCL PROFILER API, in the form of versioned structs, filled with pointers to all required
|
||||||
|
functions.
|
||||||
|
|
||||||
|
# Plugin architecture
|
||||||
|
|
||||||
|
## Plugin name and supporting multiple profiler plugins
|
||||||
|
|
||||||
|
When NCCL is initialized, it will look for a `libnccl-profiler.so` library and dynamically load
|
||||||
|
it, then look for symbols inside the library.
|
||||||
|
|
||||||
|
The `NCCL_PROFILER_PLUGIN` environment variable allows multiple plugins to coexist. If set, NCCL
|
||||||
|
will look for a library with a name of `libnccl-profiler-${NCCL_PROFILER_PLUGIN}.so`. It is therefore
|
||||||
|
advised to name the library following that pattern, with a symlink pointing `libnccl-profiler.so`
|
||||||
|
to `libnccl-profiler-${NCCL_PROFILER_PLUGIN}.so`. That way, if there are multiple plugins in the
|
||||||
|
path, setting `NCCL_PROFILER_PLUGIN` will allow users to select the right plugin. Alternatively,
|
||||||
|
the user can also set `NCCL_PROFILER_PLUGIN` to the pathname of the `libnccl-profiler.so` library.
|
||||||
|
|
||||||
|
## Struct versioning
|
||||||
|
|
||||||
|
Once a library is found, NCCL will look for a symbol named `ncclProfiler_vX`, with `X` increasing
|
||||||
|
over time. The versioning ensures that the plugin and the NCCL core are compatible.
|
||||||
|
|
||||||
|
Plugins are encouraged to provide multiple of those symbols, implementing multiple versions of the
|
||||||
|
NCCL PROFILER API, so that the same plugin can be compiled and support a wide range of NCCL versions.
|
||||||
|
|
||||||
|
Conversely, and to ease transition, NCCL can choose to support different plugin versions, looking
|
||||||
|
for the latest ncclProfiler struct version, but also looking for older ones so that older plugins
|
||||||
|
would still work.
|
||||||
|
|
||||||
|
## Headers management
|
||||||
|
|
||||||
|
To help users build plugins effortlessly, plugins should copy the `ncclProfiler_vX` definitions
|
||||||
|
they support to their internal includes. An example is shown in `ext-profiler/example` where we
|
||||||
|
keep all headers in the `nccl/` directory and provide thin layers to implement old version on top
|
||||||
|
of newer ones.
|
||||||
|
|
||||||
|
The `nccl/` directory is populated with `profiler_vX.h` files extracting all relevant definitions
|
||||||
|
from old API versions. It also provides error codes in `err.h`.
|
||||||
|
|
||||||
|
# API (v2)
|
||||||
|
|
||||||
|
Below is the main `ncclProfiler_v2` struct. Each function is explained in later sections.
|
||||||
|
|
||||||
|
```
|
||||||
|
typedef struct {
|
||||||
|
const char* name;
|
||||||
|
|
||||||
|
// init - initialize the profiler plugin
|
||||||
|
// Input
|
||||||
|
// - context : opaque profiler context object for separating profiler behavior across comms
|
||||||
|
// Output
|
||||||
|
// - eActivationMask: bitmask of active events set by the plugin
|
||||||
|
ncclResult_t (*init)(void** context, int* eActivationMask);
|
||||||
|
|
||||||
|
// startEvent - initialize and start a new event for the supplied event descriptor inside the eventset
|
||||||
|
// Input
|
||||||
|
// - context: opaque profiler context object
|
||||||
|
// - eDescr : pointer to ncclProfilerEventDescr_t object
|
||||||
|
// Output
|
||||||
|
// - eHandle: return event handle for supplied event descriptor object
|
||||||
|
ncclResult_t (*startEvent)(void* context, void** eHandle, ncclProfilerEventDescr_v2_t* eDescr);
|
||||||
|
|
||||||
|
// stopEvent - stop/finalize an event inside and event set
|
||||||
|
// Input
|
||||||
|
// - eHandle: handle to event object
|
||||||
|
ncclResult_t (*stopEvent)(void* eHandle);
|
||||||
|
|
||||||
|
// recordEventState - record event state transitions and event attribute updates
|
||||||
|
// Input
|
||||||
|
// - eHandle : handle to event object created through startEvent
|
||||||
|
// - eStateArgs: optional argument used to capture event attribute updates associated with the state transition
|
||||||
|
// - eState : event state transition
|
||||||
|
ncclResult_t (*recordEventState)(void* eHandle, ncclProfilerEventState_v2_t eState, ncclProfilerEventStateArgs_v2_t* eStateArgs);
|
||||||
|
|
||||||
|
// finalize - finalize the profiler plugin
|
||||||
|
// Input
|
||||||
|
// - context: opaque profiler context object
|
||||||
|
ncclResult_t (*finalize)(void* context);
|
||||||
|
} ncclProfiler_v2_t;
|
||||||
|
```
|
||||||
|
|
||||||
|
## Error codes
|
||||||
|
|
||||||
|
As rule of thumb, profiler generated errors should not be propagated to NCCL and alter its normal
|
||||||
|
functioning. Nevertheless, the profiler interface returns NCCL error codes, in case any need for
|
||||||
|
them arises in the future. For now, any profiler interface call should only return `ncclSuccess`.
|
||||||
|
The only exception is `init` that can return an error so that NCCL can disable the plugin.
|
||||||
|
|
||||||
|
## Operation overview
|
||||||
|
|
||||||
|
NCCL will call the `init` function first for every new communicator that is initialized. The profiler
|
||||||
|
returns an opaque context handle that is used to isolate profiler instances across communicators.
|
||||||
|
Similarly, NCCL will call `finalize` to destroy the profiler context, thus freeing resources.
|
||||||
|
|
||||||
|
The NCCL core code is instrumented with calls to `startEvent`, `stopEvent` and `recordEventState`.
|
||||||
|
These are used to start, stop and update events in the profiler, respectively.
|
||||||
|
|
||||||
|
## API Functions
|
||||||
|
|
||||||
|
### Initialization
|
||||||
|
|
||||||
|
#### name
|
||||||
|
|
||||||
|
The `name` field should point to a character string with the name of the profiler plugin. This will
|
||||||
|
be used for all logging, especially when `NCCL_DEBUG=INFO` is set.
|
||||||
|
|
||||||
|
#### init
|
||||||
|
|
||||||
|
As soon as NCCL finds the plugin and the correct ncclProfiler symbol, it calls its `init` function.
|
||||||
|
This allows the plugin to initialize its internal context, used during profiling of NCCL events.
|
||||||
|
If the `init` function does not return `ncclSuccess`, NCCL disables the plugin.
|
||||||
|
|
||||||
|
#### finalize
|
||||||
|
|
||||||
|
When the profiler is no longer needed, a call to `finalize` destroys the profiler context and frees
|
||||||
|
up resources.
|
||||||
|
|
||||||
|
### Profiling
|
||||||
|
|
||||||
|
#### startEvent
|
||||||
|
|
||||||
|
When NCCL needs to start profiling a new event it calls `startEvent`. `startEvent` takes the profiler
|
||||||
|
context, previously created by `init`, an event descriptor of type `ncclProfilerEventDescr_t` and
|
||||||
|
returns an opaque profiler event handle that can be passed to other profiler functions, as discussed
|
||||||
|
later in the document.
|
||||||
|
|
||||||
|
|
||||||
|
The event descriptor contains all the event metadata. Every event type has its own descriptor. Below
|
||||||
|
is the `ncclProfilerEventDescr_t` struct.
|
||||||
|
|
||||||
|
```
|
||||||
|
typedef struct {
|
||||||
|
uint8_t type; // event type (e.g., ncclProfileGroup, ncclProfileColl, ...)
|
||||||
|
void* parentObj; // pointer to parent event used to expose the event hierarchy to the profiler
|
||||||
|
int rank; // rank that generated the event
|
||||||
|
union {
|
||||||
|
struct { // collective events metadata
|
||||||
|
const char* name; // string containing name of the communicator
|
||||||
|
uint64_t commHash; // unique hash/id for the communicator
|
||||||
|
uint64_t seqNumber; // sequence number of this collective operation in the communicator
|
||||||
|
const char* func; // string containing name of the collective
|
||||||
|
void const* sendBuff; // address of send buffer
|
||||||
|
void* recvBuff; // address of recv buffer
|
||||||
|
size_t count; // data count
|
||||||
|
int root; // root rank
|
||||||
|
const char* datatype; // string containing the name of the datatype
|
||||||
|
size_t trafficBytes; // number of transfer bytes
|
||||||
|
uint8_t nMaxChannels; // max number of channels for this collective
|
||||||
|
uint8_t nWarps; // number of GPU warps for this collective
|
||||||
|
const char* algo; // string containing name of the algorithm for this collective
|
||||||
|
const char* proto; // string containing name of the protocol for this collective
|
||||||
|
} coll;
|
||||||
|
|
||||||
|
struct { // point-to-point events metadata
|
||||||
|
const char* name;
|
||||||
|
uint64_t commHash;
|
||||||
|
const char* func;
|
||||||
|
void* buff;
|
||||||
|
const char* datatype;
|
||||||
|
size_t count;
|
||||||
|
int peer; // peer rank for this point-to-point
|
||||||
|
} p2p;
|
||||||
|
|
||||||
|
struct { // proxyOp events metadata
|
||||||
|
pid_t pid; // process id that generated the associated `ncclProxyOp` object
|
||||||
|
uint8_t channelId; // id of the channel used by the associated `ncclProxyOp` object
|
||||||
|
int peer; // peer rank
|
||||||
|
int nSteps; // number of network transfers/steps required by the `ncclProxyOp`
|
||||||
|
int chunkSize; // chunk size for this `ncclProxyOp`
|
||||||
|
int isSend; // set to 1 for sends and 0 for recvs
|
||||||
|
} proxyOp;
|
||||||
|
|
||||||
|
struct { // proxyStep events metadata
|
||||||
|
int step; // individual step in `ncclProxyOp`
|
||||||
|
} proxyStep;
|
||||||
|
};
|
||||||
|
} ncclProfilerEventDescr_v2_t;
|
||||||
|
```
|
||||||
|
|
||||||
|
NCCL defines the following events: `ncclProfileGroup`, `ncclProfileColl`, `ncclProfileP2p`,
|
||||||
|
`ncclProfileProxyOp`, `ncclProfileProxyStep`, and `ncclProfileProxyCtrl`.
|
||||||
|
|
||||||
|
#### stopEvent
|
||||||
|
|
||||||
|
`stopEvent` takes the event handle returned by `startEvent` to stop the event. After the event
|
||||||
|
has been stopped the handle can no longer be used with other profiler calls. Using the event
|
||||||
|
handle after `eventStop` is undefined behavior.
|
||||||
|
|
||||||
|
#### recordEventState
|
||||||
|
|
||||||
|
Some events can only be started and stopped. For example, `ncclProfileGroup`, `ncclProfileColl`,
|
||||||
|
`ncclProfileP2p`, cannot be updated through calls to `recordEventState`.
|
||||||
|
|
||||||
|
`ncclProfileProxyOp`, `ncclProfileProxyStep` and `ncclProfileProxyCtrl` can be updated through
|
||||||
|
calls to `recordEventState`.
|
||||||
|
|
||||||
|
The state of proxy generated events can be updated, along with event attributes, using
|
||||||
|
`recordEventState`. These events can go through several states during their lifecycle.
|
||||||
|
The list of supported states for the proxy-defined events is reported below.
|
||||||
|
|
||||||
|
```
|
||||||
|
typedef enum {
|
||||||
|
// ncclProfileProxyOp event states
|
||||||
|
ncclProfilerProxyOpSendPosted, // state marks the posting of send buffer to GPU for given network transfer/step
|
||||||
|
ncclProfilerProxyOpSendRemFifoWait, // state marks the waiting of CTS credits from peer rank
|
||||||
|
ncclProfilerProxyOpSendTransmitted, // state marks the sending of network transfer/step to peer rank
|
||||||
|
ncclProfilerProxyOpSendDone, // state marks the ending of network transfer/step
|
||||||
|
ncclProfilerProxyOpRecvPosted, // state marks the posting of recv to network for given network transfer/step
|
||||||
|
ncclProfilerProxyOpRecvReceived, // state marks the recving of network transfer/step from peer rank
|
||||||
|
ncclProfilerProxyOpRecvTransmitted, // state marks the ending of the network transfer/step
|
||||||
|
ncclProfilerProxyOpRecvDone, // state marks the consuming of data from GPU
|
||||||
|
|
||||||
|
// ncclProfileProxyStep event states
|
||||||
|
ncclProfilerProxyStepSendGPUWait, // state marks the waiting of send data from GPU for given network transfer/step
|
||||||
|
ncclProfilerProxyStepSendWait, // state marks the waiting of send data from network for given network transfer/step
|
||||||
|
ncclProfilerProxyStepRecvWait, // state marks the waiting of recv data from network for given network transfer/step
|
||||||
|
ncclProfilerProxyStepRecvFlushWait, // state marks the waiting of recv data flush to GPU for given network transfer/step
|
||||||
|
ncclProfilerProxyStepRecvGPUWait, // state marks the waiting of recv data consumption from GPU for given network transfer/step
|
||||||
|
|
||||||
|
// ncclProfileProxyCtrl event states
|
||||||
|
ncclProfilerProxyCtrlIdle, // state marks proxy progress thread idle
|
||||||
|
ncclProfilerProxyCtrlActive, // state marks proxy progress thread active
|
||||||
|
ncclProfilerProxyCtrlSleep, // state marks proxy progress thread sleeping
|
||||||
|
ncclProfilerProxyCtrlWakeup, // state marks proxy progress thread waking up
|
||||||
|
ncclProfilerProxyCtrlAppend, // state marks append of new network work item begin
|
||||||
|
ncclProfilerProxyCtrlAppendEnd, // state marks append of new network work item end
|
||||||
|
} ncclProfilerEventState_v2_t;
|
||||||
|
```
|
||||||
|
|
||||||
|
`ncclProfileProxyOp` events are generated by the proxy progress thread while it is processing
|
||||||
|
network requests for the GPU kernel. ProxyOp events are generated for every active channel and
|
||||||
|
provide a summary of the activity of the proxy progress thread for that channel.
|
||||||
|
|
||||||
|
`ncclProfileProxyStep` events are generated by the proxy progress thread while it is processing
|
||||||
|
network requests for the GPU kernel. ProxyStep events describe individual network transfer in
|
||||||
|
the channel. Thus, they provide a more fine-grained view w.r.t. ProxyOp events.
|
||||||
|
|
||||||
|
`ncclProfileProxyCtrl` events are generated by the proxy progress thread while it is not processing
|
||||||
|
network requests for the GPU kernel. This includes everything else that the proxy thread might be
|
||||||
|
doing, including appending new `ncclProxyOp` objects to the list of work elements to process.
|
||||||
|
|
||||||
|
State transitions for the events described can also come with event attribute updates. For this
|
||||||
|
reason the profiler defines the `ncclProfilerEventStateArgs_t` struct, reported below.
|
||||||
|
|
||||||
|
```
|
||||||
|
typedef union {
|
||||||
|
struct { // attributes to update for ncclProfileProxyOp events
|
||||||
|
size_t transSize; // data transferred thus far
|
||||||
|
int steps; // network transfer/steps processed thus far
|
||||||
|
} proxyOp;
|
||||||
|
|
||||||
|
struct { // attributes to update for ncclProfileProxyCtrl
|
||||||
|
int appendedProxyOps; // number of appended proxy ops thus far
|
||||||
|
} proxyCtrl;
|
||||||
|
} ncclProfilerEventStateArgs_v2_t;
|
||||||
|
```
|
||||||
|
|
||||||
|
The example profiler in `ext-profiler/example` contains details on how to capture and use the events above.
|
||||||
|
|
||||||
|
### Event hierarchy
|
||||||
|
|
||||||
|
NCCL core events (reported above) are organized into a hierarchy as reported below:
|
||||||
|
|
||||||
|
```
|
||||||
|
Group event
|
||||||
|
|
|
||||||
|
+- Collective event
|
||||||
|
| |
|
||||||
|
| +- ProxyOp event
|
||||||
|
| |
|
||||||
|
| +- ProxyStep event
|
||||||
|
|
|
||||||
|
+- Point-to-point event
|
||||||
|
|
|
||||||
|
+- ProxyOp event
|
||||||
|
|
|
||||||
|
+- ProxyStep event
|
||||||
|
|
||||||
|
ProxyCtrl event
|
||||||
|
```
|
||||||
|
|
||||||
|
# Profiler instrumentation and logging
|
||||||
|
|
||||||
|
## Profiling of collective and p2p operations
|
||||||
|
|
||||||
|
The NCCL code is instrumented with profiler callbacks at different levels to capture start/stop of groups,
|
||||||
|
collective and point-to-point operations, as well as proxy progress activity. Due to the asynchronous nature
|
||||||
|
of NCCL operations, events associated to collective and point-to-point operations are not easy to delimit
|
||||||
|
precisely. For example, without both proxy and/or kernel activity it is impossible for the profiler to
|
||||||
|
figure out when a collective operation completes. Therefore, `stopEvent` for collectives simply indicates to
|
||||||
|
the profiler that the collective has been enqueued. The profiler can leverage proxy event information, if
|
||||||
|
these are enabled, to estimate when the collective ends. In this case, the profiler can look at the `stopEvent`
|
||||||
|
call of the last `ncclProfileProxyOp` event to mark the completion of the associated collective event. This
|
||||||
|
can be achieved by reference counting the collective event and letting calls to `startEvent` and `stopEvent`
|
||||||
|
increment and decrement the reference counter, respectively.
|
||||||
|
|
||||||
|
## PXN
|
||||||
|
|
||||||
|
PXN causes some proxy operations to be processed in a remote proxy thread that differs from the one that
|
||||||
|
generated the operation. When this happens, the event hierarchy reported above breaks. Because the
|
||||||
|
profiler can use the hierarchy information, provided by NCCL in the event descriptor, to dereference the
|
||||||
|
parent event during `startEvent`, the remote proxy thread must be in the same address space of the proxy
|
||||||
|
thread originating the operation. To avoid the profiler instance in the remote proxy address space to
|
||||||
|
dereference a pointer from another address space the event descriptor includes the PID of the originator.
|
||||||
|
The profiler plugin needs to check that the originator PID matches the local PID before dereferencing the
|
||||||
|
parent event.
|
239
ext-profiler/example/README.md
Normal file
239
ext-profiler/example/README.md
Normal file
@ -0,0 +1,239 @@
|
|||||||
|
# NCCL Example Profiler Plugin Usage
|
||||||
|
|
||||||
|
This page describes how to use the NCCL example profiler plugin
|
||||||
|
|
||||||
|
# Overview
|
||||||
|
|
||||||
|
The example profiler plugin implements the NCCL profiler plugin API introduced in NCCL v2.23. The API
|
||||||
|
defines a set of events and data structures that NCCL uses to share event information with profiler
|
||||||
|
plugins. The user can control what events are instrumented by NCCL and when traces collected by the
|
||||||
|
profiler should be dumped through environment variables, as described in the rest of the document.
|
||||||
|
The user can also control other profiler parameters that alter its behavior. For example, users can
|
||||||
|
change the size of the event window the profiler keeps track of.
|
||||||
|
|
||||||
|
## Building the profiler plugin
|
||||||
|
|
||||||
|
To use the example plugin, just type `make`. You will need a NCCL build's include directory present.
|
||||||
|
You can override `NCCL_HOME` to where the NCCL installation is on your system.
|
||||||
|
|
||||||
|
## Using the profiler plugin
|
||||||
|
|
||||||
|
1. Add the directory of this profiler plugin to your `LD_LIBRARY_PATH` or set the `NCCL_PROFILER_PLUGIN`,
|
||||||
|
as documented in `ext-profiler/README.md`.
|
||||||
|
|
||||||
|
2. Set `NCCL_PROFILE_EVENT_MASK` bitmask to specify the NCCL events you want to instrument. By
|
||||||
|
default, all collectives and send/recv operations will be traced. For more details about the event
|
||||||
|
representation used by the profiler refer to `ext-profiler/README.md`.
|
||||||
|
|
||||||
|
As an example, setting:
|
||||||
|
|
||||||
|
`NCCL_PROFILE_EVENT_MASK` to 1 (`ncclProfileGroup`) | 2 (`ncclProfileColl`) | 8 (`ncclProfileProxyOp`)
|
||||||
|
|
||||||
|
enables the profiling of the group, the collective and the proxy op events. The same events can be
|
||||||
|
expressed more concisely by setting `NCCL_PROFILE_EVENT_MASK` to 8 (`ncclProfileProxyOp`). Indeed,
|
||||||
|
in NCCL all the events above (in the event hierarchy) the one requested are also captured. The advantage
|
||||||
|
is that the profiler can easily correlate events that belong to the same NCCL operation and present
|
||||||
|
them accordingly.
|
||||||
|
|
||||||
|
3. Set `NCCL_PROFILE_DUMP_FILE` to the name of the dump file for the collected traces. A file named
|
||||||
|
${NCCL_PROFILE_DUMP_FILE}-hostname-tid.txt is created. Profiler traces are saved using the chrome
|
||||||
|
event format (more precisely, using asynchronous events).
|
||||||
|
|
||||||
|
4. If you set the dump file variable, type chrome://tracing on your chromium browser search bar and
|
||||||
|
open the created dump file to visualize the traces.
|
||||||
|
|
||||||
|
# Changing the profiler memory pool sizes
|
||||||
|
|
||||||
|
The example profiler uses separate memory pools for different types of events. The size of these memory
|
||||||
|
pools (i.e., the # events) determines the number of events that the profiler can keep track of at the
|
||||||
|
same time. When NCCL requests a new event (e.g., collective event) to profile a `ncclAllReduce`
|
||||||
|
operation, by calling `startEvent`, the profiler searches in the collective pool for a free event. If it
|
||||||
|
finds one, it marks it as in use and returns the handle to NCCL. If the pool is completely used the
|
||||||
|
profiler returns `NULL` to NCCL and ignores all the following NCCL profiler calls for the `NULL` event
|
||||||
|
handle. When the `ncclAllReduce` has been processed, NCCL calls `stopEvent` with the previosly returned
|
||||||
|
event handle. The profiler has a total of 5 memory pools.
|
||||||
|
|
||||||
|
The group, collective and p2p pools contain objects for the corresponding events. The `ProxyCtrl` pool
|
||||||
|
contains objects for `ProxyCtrl` events and the `ProxyDetach` pool contains objects for `ProxyOp` events
|
||||||
|
generated by remote proxies. A list of pools and their size is reported below:
|
||||||
|
|
||||||
|
- `NCCL_PROFILE_GROUP_POOL_SIZE` (16)
|
||||||
|
- `NCCL_PROFILE_COLL_POOL_SIZE` (16)
|
||||||
|
- `NCCL_PROFILE_P2P_POOL_SIZE` (1024)
|
||||||
|
- `NCCL_PROFILE_PROXY_CTRL_POOL_SIZE` (16)
|
||||||
|
- `NCCL_PROFILE_PROXY_DETACH_POOL_SIZE` (128)
|
||||||
|
|
||||||
|
Remote proxy operations are generated when PXN is in use. Refer to this article for more information
|
||||||
|
about PXN and how it works:
|
||||||
|
https://developer.nvidia.com/blog/doubling-all2all-performance-with-nvidia-collective-communication-library-2-12/
|
||||||
|
|
||||||
|
# Reported events
|
||||||
|
|
||||||
|
The example profiler generates traces using the json format. An example of trace is reported below:
|
||||||
|
|
||||||
|
```
|
||||||
|
[
|
||||||
|
{"name": "Group", "cat": "GROUP", "ph": "b", "id": 0, "pid": 4157654, "tid": 1, "ts": 764234.611328, "args": {"groupId": 0}},
|
||||||
|
{"name": "AllReduce", "cat": "COLL", "ph": "b", "id": 0, "pid": 4157654, "tid": 1, "ts": 764237.294922, "args": {"SeqNum": 0, "CommHash": 673864846479792718, "Rank": 1, "Count": 32768, "Datatype": "ncclFloat32", "Algorithm": "RING", "Protocol": "LL", "nMaxChannels": 2}},
|
||||||
|
{"name": "Recv", "cat": "PROXY", "ph": "b", "id": 0, "pid": 4157654, "tid": 1, "ts": 768464.936523, "args": {"Channel": 0, "Peer": 0, "Steps": 14, "ChunkSize": 32768, "transSize": 229376, "POSTED": {"step": 14, "ts": 772020.300781}, "RECEIVED": {"step": 14, "ts": 772196.049805}, "TRANSMITTED": {"step": 14, "ts": 772197.326172}, "DONE": {"step": 14, "ts": 772201.538086}}},
|
||||||
|
{"name": "RecvBufferWait", "cat": "NET", "ph": "b", "id": 0, "pid": 4157654, "tid": 1, "ts": 768465.158203, "args": {"Step": 0}},
|
||||||
|
{"name": "RecvBufferWait", "cat": "NET", "ph": "e", "id": 0, "pid": 4157654, "tid": 1, "ts": 768477.924805},
|
||||||
|
{"name": "RecvWait", "cat": "NET", "ph": "b", "id": 0, "pid": 4157654, "tid": 1, "ts": 768477.924805, "args": {"Step": 0}},
|
||||||
|
{"name": "RecvWait", "cat": "NET", "ph": "e", "id": 0, "pid": 4157654, "tid": 1, "ts": 768547.197266},
|
||||||
|
{"name": "RecvFlushWait", "cat": "NET", "ph": "b", "id": 0, "pid": 4157654, "tid": 1, "ts": 768547.197266, "args": {"Step": 0}},
|
||||||
|
{"name": "RecvFlushWait", "cat": "NET", "ph": "e", "id": 0, "pid": 4157654, "tid": 1, "ts": 768564.174805},
|
||||||
|
{"name": "RecvGpuWait", "cat": "NET", "ph": "b", "id": 0, "pid": 4157654, "tid": 1, "ts": 768564.174805, "args": {"Step": 0}},
|
||||||
|
{"name": "RecvGpuWait", "cat": "NET", "ph": "e", "id": 0, "pid": 4157654, "tid": 1, "ts": 768568.276367},
|
||||||
|
{"name": "RecvBufferWait", "cat": "NET", "ph": "b", "id": 1, "pid": 4157654, "tid": 1, "ts": 768503.604492, "args": {"Step": 1}},
|
||||||
|
{"name": "RecvBufferWait", "cat": "NET", "ph": "e", "id": 1, "pid": 4157654, "tid": 1, "ts": 768504.549805},
|
||||||
|
{"name": "RecvWait", "cat": "NET", "ph": "b", "id": 1, "pid": 4157654, "tid": 1, "ts": 768504.549805, "args": {"Step": 1}},
|
||||||
|
{"name": "RecvWait", "cat": "NET", "ph": "e", "id": 1, "pid": 4157654, "tid": 1, "ts": 769994.490234},
|
||||||
|
{"name": "RecvFlushWait", "cat": "NET", "ph": "b", "id": 1, "pid": 4157654, "tid": 1, "ts": 769994.490234, "args": {"Step": 1}},
|
||||||
|
{"name": "RecvFlushWait", "cat": "NET", "ph": "e", "id": 1, "pid": 4157654, "tid": 1, "ts": 769995.012695},
|
||||||
|
{"name": "RecvGpuWait", "cat": "NET", "ph": "b", "id": 1, "pid": 4157654, "tid": 1, "ts": 769995.012695, "args": {"Step": 1}},
|
||||||
|
{"name": "RecvGpuWait", "cat": "NET", "ph": "e", "id": 1, "pid": 4157654, "tid": 1, "ts": 770006.914062},
|
||||||
|
{"name": "RecvBufferWait", "cat": "NET", "ph": "b", "id": 2, "pid": 4157654, "tid": 1, "ts": 768506.941406, "args": {"Step": 2}},
|
||||||
|
{"name": "RecvBufferWait", "cat": "NET", "ph": "e", "id": 2, "pid": 4157654, "tid": 1, "ts": 768507.435547},
|
||||||
|
{"name": "RecvWait", "cat": "NET", "ph": "b", "id": 2, "pid": 4157654, "tid": 1, "ts": 768507.435547, "args": {"Step": 2}},
|
||||||
|
{"name": "RecvWait", "cat": "NET", "ph": "e", "id": 2, "pid": 4157654, "tid": 1, "ts": 771452.536133},
|
||||||
|
{"name": "RecvFlushWait", "cat": "NET", "ph": "b", "id": 2, "pid": 4157654, "tid": 1, "ts": 771452.536133, "args": {"Step": 2}},
|
||||||
|
{"name": "RecvFlushWait", "cat": "NET", "ph": "e", "id": 2, "pid": 4157654, "tid": 1, "ts": 771453.060547},
|
||||||
|
{"name": "RecvGpuWait", "cat": "NET", "ph": "b", "id": 2, "pid": 4157654, "tid": 1, "ts": 771453.060547, "args": {"Step": 2}},
|
||||||
|
{"name": "RecvGpuWait", "cat": "NET", "ph": "e", "id": 2, "pid": 4157654, "tid": 1, "ts": 771468.458008},
|
||||||
|
{"name": "RecvBufferWait", "cat": "NET", "ph": "b", "id": 3, "pid": 4157654, "tid": 1, "ts": 768509.484375, "args": {"Step": 3}},
|
||||||
|
{"name": "RecvBufferWait", "cat": "NET", "ph": "e", "id": 3, "pid": 4157654, "tid": 1, "ts": 768510.250000},
|
||||||
|
{"name": "RecvWait", "cat": "NET", "ph": "b", "id": 3, "pid": 4157654, "tid": 1, "ts": 768510.250000, "args": {"Step": 3}},
|
||||||
|
{"name": "RecvWait", "cat": "NET", "ph": "e", "id": 3, "pid": 4157654, "tid": 1, "ts": 771904.499023},
|
||||||
|
{"name": "RecvFlushWait", "cat": "NET", "ph": "b", "id": 3, "pid": 4157654, "tid": 1, "ts": 771904.499023, "args": {"Step": 3}},
|
||||||
|
{"name": "RecvFlushWait", "cat": "NET", "ph": "e", "id": 3, "pid": 4157654, "tid": 1, "ts": 771904.991211},
|
||||||
|
{"name": "RecvGpuWait", "cat": "NET", "ph": "b", "id": 3, "pid": 4157654, "tid": 1, "ts": 771904.991211, "args": {"Step": 3}},
|
||||||
|
{"name": "RecvGpuWait", "cat": "NET", "ph": "e", "id": 3, "pid": 4157654, "tid": 1, "ts": 771910.500000},
|
||||||
|
{"name": "Send", "cat": "PROXY", "ph": "b", "id": 1, "pid": 4157654, "tid": 1, "ts": 768482.878906, "args": {"Channel": 0, "Peer": 2, "Steps": 14, "ChunkSize": 32768, "transSize": 229376, "POSTED": {"step": 14, "ts": 771995.675781}, "REM_FIFO_WAIT": {"step": 14, "ts": 772190.692383}, "TRANSMITTED": {"step": 14, "ts": 772191.516602}, "DONE": {"step": 14, "ts": 772208.473633}}},
|
||||||
|
{"name": "SendBufferWait", "cat": "NET", "ph": "b", "id": 14, "pid": 4157654, "tid": 1, "ts": 768483.019531, "args": {"Step": 0}},
|
||||||
|
{"name": "SendBufferWait", "cat": "NET", "ph": "e", "id": 14, "pid": 4157654, "tid": 1, "ts": 768483.300781},
|
||||||
|
{"name": "SendGpuWait", "cat": "NET", "ph": "b", "id": 14, "pid": 4157654, "tid": 1, "ts": 768483.300781, "args": {"Step": 0}},
|
||||||
|
{"name": "SendGpuWait", "cat": "NET", "ph": "e", "id": 14, "pid": 4157654, "tid": 1, "ts": 769594.615234},
|
||||||
|
{"name": "SendWait", "cat": "NET", "ph": "b", "id": 14, "pid": 4157654, "tid": 1, "ts": 769594.615234, "args": {"Step": 0}},
|
||||||
|
{"name": "SendWait", "cat": "NET", "ph": "e", "id": 14, "pid": 4157654, "tid": 1, "ts": 769618.889648},
|
||||||
|
{"name": "SendBufferWait", "cat": "NET", "ph": "b", "id": 15, "pid": 4157654, "tid": 1, "ts": 768505.083008, "args": {"Step": 1}},
|
||||||
|
{"name": "SendBufferWait", "cat": "NET", "ph": "e", "id": 15, "pid": 4157654, "tid": 1, "ts": 768505.163086},
|
||||||
|
{"name": "SendGpuWait", "cat": "NET", "ph": "b", "id": 15, "pid": 4157654, "tid": 1, "ts": 768505.163086, "args": {"Step": 1}},
|
||||||
|
{"name": "SendGpuWait", "cat": "NET", "ph": "e", "id": 15, "pid": 4157654, "tid": 1, "ts": 769610.555664},
|
||||||
|
{"name": "SendWait", "cat": "NET", "ph": "b", "id": 15, "pid": 4157654, "tid": 1, "ts": 769610.555664, "args": {"Step": 1}},
|
||||||
|
{"name": "SendWait", "cat": "NET", "ph": "e", "id": 15, "pid": 4157654, "tid": 1, "ts": 769622.517578},
|
||||||
|
{"name": "SendBufferWait", "cat": "NET", "ph": "b", "id": 16, "pid": 4157654, "tid": 1, "ts": 768507.937500, "args": {"Step": 2}},
|
||||||
|
{"name": "SendBufferWait", "cat": "NET", "ph": "e", "id": 16, "pid": 4157654, "tid": 1, "ts": 768508.017578},
|
||||||
|
{"name": "SendGpuWait", "cat": "NET", "ph": "b", "id": 16, "pid": 4157654, "tid": 1, "ts": 768508.017578, "args": {"Step": 2}},
|
||||||
|
{"name": "SendGpuWait", "cat": "NET", "ph": "e", "id": 16, "pid": 4157654, "tid": 1, "ts": 770002.129883},
|
||||||
|
{"name": "SendWait", "cat": "NET", "ph": "b", "id": 16, "pid": 4157654, "tid": 1, "ts": 770002.129883, "args": {"Step": 2}},
|
||||||
|
{"name": "SendWait", "cat": "NET", "ph": "e", "id": 16, "pid": 4157654, "tid": 1, "ts": 770013.848633},
|
||||||
|
{"name": "SendBufferWait", "cat": "NET", "ph": "b", "id": 17, "pid": 4157654, "tid": 1, "ts": 768510.742188, "args": {"Step": 3}},
|
||||||
|
{"name": "SendBufferWait", "cat": "NET", "ph": "e", "id": 17, "pid": 4157654, "tid": 1, "ts": 768510.822266},
|
||||||
|
{"name": "SendGpuWait", "cat": "NET", "ph": "b", "id": 17, "pid": 4157654, "tid": 1, "ts": 768510.822266, "args": {"Step": 3}},
|
||||||
|
{"name": "SendGpuWait", "cat": "NET", "ph": "e", "id": 17, "pid": 4157654, "tid": 1, "ts": 771461.563477},
|
||||||
|
{"name": "SendWait", "cat": "NET", "ph": "b", "id": 17, "pid": 4157654, "tid": 1, "ts": 771461.563477, "args": {"Step": 3}},
|
||||||
|
{"name": "SendWait", "cat": "NET", "ph": "e", "id": 17, "pid": 4157654, "tid": 1, "ts": 771469.171875},
|
||||||
|
... [ trace truncated for brevity ]
|
||||||
|
{"name": "AllReduce", "cat": "COLL", "ph": "e", "id": 0, "pid": 4157654, "tid": 1, "ts": 772209.317383},
|
||||||
|
{"name": "Group", "cat": "GROUP", "ph": "e", "id": 0, "pid": 4157654, "tid": 1, "ts": 772209.418945},
|
||||||
|
{}]
|
||||||
|
```
|
||||||
|
|
||||||
|
Details about the fields used in the trace can be found at this link:
|
||||||
|
https://docs.google.com/document/d/1CvAClvFfyA5R-PhYUmn5OOQtYMH4h6I0nSsKchNAySU/preview?tab=t.0#heading=h.yr4qxyxotyw
|
||||||
|
|
||||||
|
The trace above is obtained by running a `ncclAllReduce` operation on 8 GPUs, communicating with each other through
|
||||||
|
the network interface. The `Group` event encloses all traces that are related to the single `ncclAllReduce` call.
|
||||||
|
(Note that for single collective invocations, where there are no explicit group calls, NCCL creates a group with only
|
||||||
|
one collective and this is what is presented in the traces above).
|
||||||
|
|
||||||
|
|
||||||
|
The `AllReduce` event encloses traces for the proxy operation associated to the `ncclAllReduce` operation. The `args`
|
||||||
|
field in the traces contains NCCL specific information (aside from the chrome trace event format).
|
||||||
|
|
||||||
|
## AllReduce trace
|
||||||
|
|
||||||
|
The `AllReduce` entry presents information about the `ncclAllReduce` operation. It contains the following info in the args field:
|
||||||
|
|
||||||
|
- seqNum : sequential number of the collective in the communicator (every collective type has its own sequence number in the communicator)
|
||||||
|
- commHash : communicator unique identifier
|
||||||
|
- rank : NCCL rank for the ncclAllReduce
|
||||||
|
- datatype : NCCL datatype
|
||||||
|
- algorithm : algorithm used to process the ncclAllReduce
|
||||||
|
- protocol : protocol used to process the ncclAllReduce
|
||||||
|
- nMaxChannels: max number of channels used to process the ncclAllReduce
|
||||||
|
|
||||||
|
If the proxy events are not active (e.g., the `ncclAllReduce` is intranode) the end timestamp will match the time
|
||||||
|
consumed by the CPU to launch the collective. For more details refer to `ext-profiler/README.md`, section `Profiling
|
||||||
|
of collective and p2p operations`.
|
||||||
|
|
||||||
|
### Proxy Send
|
||||||
|
The `Send` entry presents information about the `ProxyOp` processing in the progress thread. It contains the following
|
||||||
|
info in the args field:
|
||||||
|
|
||||||
|
- Channel : id of the channel used by this proxy operation to send data to the peer
|
||||||
|
- Peer : peer rank
|
||||||
|
- Steps : number of network steps required to transfer transSize bytes to the peer
|
||||||
|
- ChunkSize : chunk size used by NCCL to pipeline data through the proxy thread
|
||||||
|
- transSize : bytes transferred across the channel by this proxy operation
|
||||||
|
- POSTED : struct containing the number of buffer posts to the GPU and the time stamp for the last post
|
||||||
|
- REM_FIFO_WAIT: struct containing the number of remote buffer waits and the time stamp for the last wait
|
||||||
|
- TRANSMITTED : struct containing the number of network sends and the time stamp of the last send
|
||||||
|
- DONE : struct containing the number of network sends completed and the time stamp of the last send completed
|
||||||
|
|
||||||
|
In case of a network problem the POSTED, REM_FIFO_WAIT, TRANSMITTED and DONE might all have partially updated steps,
|
||||||
|
which could help identify at which point the network problem occurred.
|
||||||
|
|
||||||
|
The Proxy send trace gives a summary of the proxy progress thread activity for the channel. If more details are
|
||||||
|
needed, these can be obtained by enabling the proxy step event (`ncclProfileProxyStep`). In which case the trace
|
||||||
|
entries below are also reported by the profiler.
|
||||||
|
|
||||||
|
#### Proxy SendBufferWait
|
||||||
|
|
||||||
|
Presents, for every network step, the time the CPU proxy spends waiting for the channel staging buffer to become available.
|
||||||
|
|
||||||
|
#### Proxy SendGPUWait
|
||||||
|
|
||||||
|
Presents, for every network step, the time the CPU proxy spends waiting for the GPU to provide the data in the staging
|
||||||
|
buffer.
|
||||||
|
|
||||||
|
#### Proxy SendWait
|
||||||
|
|
||||||
|
Presents, for every network step, the time the CPU proxy spends waiting for the `isend` to complete
|
||||||
|
|
||||||
|
### Proxy Recv
|
||||||
|
|
||||||
|
The `Recv` entry presents information about the `ProxyOp` processing in the progress thread. It contains the following
|
||||||
|
info in the args field:
|
||||||
|
|
||||||
|
- Channel : id of the channel used by this proxy operation to recv data from the peer
|
||||||
|
- Peer : peer rank
|
||||||
|
- Steps : number of network steps required to transfer transSize bytes from the peer
|
||||||
|
- ChunkSize : chunk size used by NCCL to pipeline data through the proxy thread
|
||||||
|
- transSize : bytes transferred across the channel by this proxy operation
|
||||||
|
- POSTED : struct containing the number of recvs posted and the time stamp for the last recv posted
|
||||||
|
- RECEIVED : struct containing the number of recvs completed and the time stamp for the last recv completed
|
||||||
|
- TRANSMITTED: struct containing the number of recvs flushed to the GPU memory and the time stamp for the last recv flushed
|
||||||
|
- DONE : struct containing the number of flush completed and the time stamp for the last flush completed
|
||||||
|
|
||||||
|
The Proxy Recv trace gives a summary of the proxy progress thread activity for the channel. If more details are
|
||||||
|
needed, these can be obtained by enabling the proxy step event (`ncclProfileProxyStep`). In which case the trace
|
||||||
|
entries below are also reported by the profiler.
|
||||||
|
|
||||||
|
|
||||||
|
#### Proxy RecvBufferWait
|
||||||
|
|
||||||
|
Presents, for every network step, the time the CPU proxy spends waiting for the staging buffer for the channel to
|
||||||
|
become available.
|
||||||
|
|
||||||
|
#### Proxy RecvWait
|
||||||
|
|
||||||
|
Presents, for every network step, the time the CPU proxy spends waiting for a posted `irecv` to complete
|
||||||
|
|
||||||
|
#### Proxy RecvFlushWait
|
||||||
|
|
||||||
|
Presents, for every network step, the time the CPU proxy spends waitng for the recv data to be flushed to the GPU
|
||||||
|
|
||||||
|
#### Proxy RecvGPUWait
|
||||||
|
|
||||||
|
Presents, for every network step, the time the CPU proxy spends waiting for the GPU to consume the recv data
|
Loading…
x
Reference in New Issue
Block a user