Add profiler documentation
Add the following files: - ext-profiler/README.md: plugin writed documentation - ext-profiler/example/README.md: example plugin user documentation
This commit is contained in:
parent
dcdc67c40b
commit
d7ccab8b7e
318
ext-profiler/README.md
Normal file
318
ext-profiler/README.md
Normal file
@ -0,0 +1,318 @@
|
||||
# NCCL Profiler Plugin Documentation
|
||||
|
||||
This page describes the NCCL Profiler plugin API and how to implement a profiler plugin for NCCL.
|
||||
|
||||
# Overview
|
||||
|
||||
To allow NCCL to better integrate with DL frameworks, NCCL v2.23 introduced a profiler plugin
|
||||
interface. Any NCCL user can write profiler plugins to extract performance data from NCCL and
|
||||
use it for debugging and analysis.
|
||||
|
||||
Similarly to other plugins (e.g., network plugin), the profiler plugins come as a shared library
|
||||
called `libnccl-profiler.so`. That shared library contains one or more implementations of the
|
||||
NCCL PROFILER API, in the form of versioned structs, filled with pointers to all required
|
||||
functions.
|
||||
|
||||
# Plugin architecture
|
||||
|
||||
## Plugin name and supporting multiple profiler plugins
|
||||
|
||||
When NCCL is initialized, it will look for a `libnccl-profiler.so` library and dynamically load
|
||||
it, then look for symbols inside the library.
|
||||
|
||||
The `NCCL_PROFILER_PLUGIN` environment variable allows multiple plugins to coexist. If set, NCCL
|
||||
will look for a library with a name of `libnccl-profiler-${NCCL_PROFILER_PLUGIN}.so`. It is therefore
|
||||
advised to name the library following that pattern, with a symlink pointing `libnccl-profiler.so`
|
||||
to `libnccl-profiler-${NCCL_PROFILER_PLUGIN}.so`. That way, if there are multiple plugins in the
|
||||
path, setting `NCCL_PROFILER_PLUGIN` will allow users to select the right plugin. Alternatively,
|
||||
the user can also set `NCCL_PROFILER_PLUGIN` to the pathname of the `libnccl-profiler.so` library.
|
||||
|
||||
## Struct versioning
|
||||
|
||||
Once a library is found, NCCL will look for a symbol named `ncclProfiler_vX`, with `X` increasing
|
||||
over time. The versioning ensures that the plugin and the NCCL core are compatible.
|
||||
|
||||
Plugins are encouraged to provide multiple of those symbols, implementing multiple versions of the
|
||||
NCCL PROFILER API, so that the same plugin can be compiled and support a wide range of NCCL versions.
|
||||
|
||||
Conversely, and to ease transition, NCCL can choose to support different plugin versions, looking
|
||||
for the latest ncclProfiler struct version, but also looking for older ones so that older plugins
|
||||
would still work.
|
||||
|
||||
## Headers management
|
||||
|
||||
To help users build plugins effortlessly, plugins should copy the `ncclProfiler_vX` definitions
|
||||
they support to their internal includes. An example is shown in `ext-profiler/example` where we
|
||||
keep all headers in the `nccl/` directory and provide thin layers to implement old version on top
|
||||
of newer ones.
|
||||
|
||||
The `nccl/` directory is populated with `profiler_vX.h` files extracting all relevant definitions
|
||||
from old API versions. It also provides error codes in `err.h`.
|
||||
|
||||
# API (v2)
|
||||
|
||||
Below is the main `ncclProfiler_v2` struct. Each function is explained in later sections.
|
||||
|
||||
```
|
||||
typedef struct {
|
||||
const char* name;
|
||||
|
||||
// init - initialize the profiler plugin
|
||||
// Input
|
||||
// - context : opaque profiler context object for separating profiler behavior across comms
|
||||
// Output
|
||||
// - eActivationMask: bitmask of active events set by the plugin
|
||||
ncclResult_t (*init)(void** context, int* eActivationMask);
|
||||
|
||||
// startEvent - initialize and start a new event for the supplied event descriptor inside the eventset
|
||||
// Input
|
||||
// - context: opaque profiler context object
|
||||
// - eDescr : pointer to ncclProfilerEventDescr_t object
|
||||
// Output
|
||||
// - eHandle: return event handle for supplied event descriptor object
|
||||
ncclResult_t (*startEvent)(void* context, void** eHandle, ncclProfilerEventDescr_v2_t* eDescr);
|
||||
|
||||
// stopEvent - stop/finalize an event inside and event set
|
||||
// Input
|
||||
// - eHandle: handle to event object
|
||||
ncclResult_t (*stopEvent)(void* eHandle);
|
||||
|
||||
// recordEventState - record event state transitions and event attribute updates
|
||||
// Input
|
||||
// - eHandle : handle to event object created through startEvent
|
||||
// - eStateArgs: optional argument used to capture event attribute updates associated with the state transition
|
||||
// - eState : event state transition
|
||||
ncclResult_t (*recordEventState)(void* eHandle, ncclProfilerEventState_v2_t eState, ncclProfilerEventStateArgs_v2_t* eStateArgs);
|
||||
|
||||
// finalize - finalize the profiler plugin
|
||||
// Input
|
||||
// - context: opaque profiler context object
|
||||
ncclResult_t (*finalize)(void* context);
|
||||
} ncclProfiler_v2_t;
|
||||
```
|
||||
|
||||
## Error codes
|
||||
|
||||
As rule of thumb, profiler generated errors should not be propagated to NCCL and alter its normal
|
||||
functioning. Nevertheless, the profiler interface returns NCCL error codes, in case any need for
|
||||
them arises in the future. For now, any profiler interface call should only return `ncclSuccess`.
|
||||
The only exception is `init` that can return an error so that NCCL can disable the plugin.
|
||||
|
||||
## Operation overview
|
||||
|
||||
NCCL will call the `init` function first for every new communicator that is initialized. The profiler
|
||||
returns an opaque context handle that is used to isolate profiler instances across communicators.
|
||||
Similarly, NCCL will call `finalize` to destroy the profiler context, thus freeing resources.
|
||||
|
||||
The NCCL core code is instrumented with calls to `startEvent`, `stopEvent` and `recordEventState`.
|
||||
These are used to start, stop and update events in the profiler, respectively.
|
||||
|
||||
## API Functions
|
||||
|
||||
### Initialization
|
||||
|
||||
#### name
|
||||
|
||||
The `name` field should point to a character string with the name of the profiler plugin. This will
|
||||
be used for all logging, especially when `NCCL_DEBUG=INFO` is set.
|
||||
|
||||
#### init
|
||||
|
||||
As soon as NCCL finds the plugin and the correct ncclProfiler symbol, it calls its `init` function.
|
||||
This allows the plugin to initialize its internal context, used during profiling of NCCL events.
|
||||
If the `init` function does not return `ncclSuccess`, NCCL disables the plugin.
|
||||
|
||||
#### finalize
|
||||
|
||||
When the profiler is no longer needed, a call to `finalize` destroys the profiler context and frees
|
||||
up resources.
|
||||
|
||||
### Profiling
|
||||
|
||||
#### startEvent
|
||||
|
||||
When NCCL needs to start profiling a new event it calls `startEvent`. `startEvent` takes the profiler
|
||||
context, previously created by `init`, an event descriptor of type `ncclProfilerEventDescr_t` and
|
||||
returns an opaque profiler event handle that can be passed to other profiler functions, as discussed
|
||||
later in the document.
|
||||
|
||||
|
||||
The event descriptor contains all the event metadata. Every event type has its own descriptor. Below
|
||||
is the `ncclProfilerEventDescr_t` struct.
|
||||
|
||||
```
|
||||
typedef struct {
|
||||
uint8_t type; // event type (e.g., ncclProfileGroup, ncclProfileColl, ...)
|
||||
void* parentObj; // pointer to parent event used to expose the event hierarchy to the profiler
|
||||
int rank; // rank that generated the event
|
||||
union {
|
||||
struct { // collective events metadata
|
||||
const char* name; // string containing name of the communicator
|
||||
uint64_t commHash; // unique hash/id for the communicator
|
||||
uint64_t seqNumber; // sequence number of this collective operation in the communicator
|
||||
const char* func; // string containing name of the collective
|
||||
void const* sendBuff; // address of send buffer
|
||||
void* recvBuff; // address of recv buffer
|
||||
size_t count; // data count
|
||||
int root; // root rank
|
||||
const char* datatype; // string containing the name of the datatype
|
||||
size_t trafficBytes; // number of transfer bytes
|
||||
uint8_t nMaxChannels; // max number of channels for this collective
|
||||
uint8_t nWarps; // number of GPU warps for this collective
|
||||
const char* algo; // string containing name of the algorithm for this collective
|
||||
const char* proto; // string containing name of the protocol for this collective
|
||||
} coll;
|
||||
|
||||
struct { // point-to-point events metadata
|
||||
const char* name;
|
||||
uint64_t commHash;
|
||||
const char* func;
|
||||
void* buff;
|
||||
const char* datatype;
|
||||
size_t count;
|
||||
int peer; // peer rank for this point-to-point
|
||||
} p2p;
|
||||
|
||||
struct { // proxyOp events metadata
|
||||
pid_t pid; // process id that generated the associated `ncclProxyOp` object
|
||||
uint8_t channelId; // id of the channel used by the associated `ncclProxyOp` object
|
||||
int peer; // peer rank
|
||||
int nSteps; // number of network transfers/steps required by the `ncclProxyOp`
|
||||
int chunkSize; // chunk size for this `ncclProxyOp`
|
||||
int isSend; // set to 1 for sends and 0 for recvs
|
||||
} proxyOp;
|
||||
|
||||
struct { // proxyStep events metadata
|
||||
int step; // individual step in `ncclProxyOp`
|
||||
} proxyStep;
|
||||
};
|
||||
} ncclProfilerEventDescr_v2_t;
|
||||
```
|
||||
|
||||
NCCL defines the following events: `ncclProfileGroup`, `ncclProfileColl`, `ncclProfileP2p`,
|
||||
`ncclProfileProxyOp`, `ncclProfileProxyStep`, and `ncclProfileProxyCtrl`.
|
||||
|
||||
#### stopEvent
|
||||
|
||||
`stopEvent` takes the event handle returned by `startEvent` to stop the event. After the event
|
||||
has been stopped the handle can no longer be used with other profiler calls. Using the event
|
||||
handle after `eventStop` is undefined behavior.
|
||||
|
||||
#### recordEventState
|
||||
|
||||
Some events can only be started and stopped. For example, `ncclProfileGroup`, `ncclProfileColl`,
|
||||
`ncclProfileP2p`, cannot be updated through calls to `recordEventState`.
|
||||
|
||||
`ncclProfileProxyOp`, `ncclProfileProxyStep` and `ncclProfileProxyCtrl` can be updated through
|
||||
calls to `recordEventState`.
|
||||
|
||||
The state of proxy generated events can be updated, along with event attributes, using
|
||||
`recordEventState`. These events can go through several states during their lifecycle.
|
||||
The list of supported states for the proxy-defined events is reported below.
|
||||
|
||||
```
|
||||
typedef enum {
|
||||
// ncclProfileProxyOp event states
|
||||
ncclProfilerProxyOpSendPosted, // state marks the posting of send buffer to GPU for given network transfer/step
|
||||
ncclProfilerProxyOpSendRemFifoWait, // state marks the waiting of CTS credits from peer rank
|
||||
ncclProfilerProxyOpSendTransmitted, // state marks the sending of network transfer/step to peer rank
|
||||
ncclProfilerProxyOpSendDone, // state marks the ending of network transfer/step
|
||||
ncclProfilerProxyOpRecvPosted, // state marks the posting of recv to network for given network transfer/step
|
||||
ncclProfilerProxyOpRecvReceived, // state marks the recving of network transfer/step from peer rank
|
||||
ncclProfilerProxyOpRecvTransmitted, // state marks the ending of the network transfer/step
|
||||
ncclProfilerProxyOpRecvDone, // state marks the consuming of data from GPU
|
||||
|
||||
// ncclProfileProxyStep event states
|
||||
ncclProfilerProxyStepSendGPUWait, // state marks the waiting of send data from GPU for given network transfer/step
|
||||
ncclProfilerProxyStepSendWait, // state marks the waiting of send data from network for given network transfer/step
|
||||
ncclProfilerProxyStepRecvWait, // state marks the waiting of recv data from network for given network transfer/step
|
||||
ncclProfilerProxyStepRecvFlushWait, // state marks the waiting of recv data flush to GPU for given network transfer/step
|
||||
ncclProfilerProxyStepRecvGPUWait, // state marks the waiting of recv data consumption from GPU for given network transfer/step
|
||||
|
||||
// ncclProfileProxyCtrl event states
|
||||
ncclProfilerProxyCtrlIdle, // state marks proxy progress thread idle
|
||||
ncclProfilerProxyCtrlActive, // state marks proxy progress thread active
|
||||
ncclProfilerProxyCtrlSleep, // state marks proxy progress thread sleeping
|
||||
ncclProfilerProxyCtrlWakeup, // state marks proxy progress thread waking up
|
||||
ncclProfilerProxyCtrlAppend, // state marks append of new network work item begin
|
||||
ncclProfilerProxyCtrlAppendEnd, // state marks append of new network work item end
|
||||
} ncclProfilerEventState_v2_t;
|
||||
```
|
||||
|
||||
`ncclProfileProxyOp` events are generated by the proxy progress thread while it is processing
|
||||
network requests for the GPU kernel. ProxyOp events are generated for every active channel and
|
||||
provide a summary of the activity of the proxy progress thread for that channel.
|
||||
|
||||
`ncclProfileProxyStep` events are generated by the proxy progress thread while it is processing
|
||||
network requests for the GPU kernel. ProxyStep events describe individual network transfer in
|
||||
the channel. Thus, they provide a more fine-grained view w.r.t. ProxyOp events.
|
||||
|
||||
`ncclProfileProxyCtrl` events are generated by the proxy progress thread while it is not processing
|
||||
network requests for the GPU kernel. This includes everything else that the proxy thread might be
|
||||
doing, including appending new `ncclProxyOp` objects to the list of work elements to process.
|
||||
|
||||
State transitions for the events described can also come with event attribute updates. For this
|
||||
reason the profiler defines the `ncclProfilerEventStateArgs_t` struct, reported below.
|
||||
|
||||
```
|
||||
typedef union {
|
||||
struct { // attributes to update for ncclProfileProxyOp events
|
||||
size_t transSize; // data transferred thus far
|
||||
int steps; // network transfer/steps processed thus far
|
||||
} proxyOp;
|
||||
|
||||
struct { // attributes to update for ncclProfileProxyCtrl
|
||||
int appendedProxyOps; // number of appended proxy ops thus far
|
||||
} proxyCtrl;
|
||||
} ncclProfilerEventStateArgs_v2_t;
|
||||
```
|
||||
|
||||
The example profiler in `ext-profiler/example` contains details on how to capture and use the events above.
|
||||
|
||||
### Event hierarchy
|
||||
|
||||
NCCL core events (reported above) are organized into a hierarchy as reported below:
|
||||
|
||||
```
|
||||
Group event
|
||||
|
|
||||
+- Collective event
|
||||
| |
|
||||
| +- ProxyOp event
|
||||
| |
|
||||
| +- ProxyStep event
|
||||
|
|
||||
+- Point-to-point event
|
||||
|
|
||||
+- ProxyOp event
|
||||
|
|
||||
+- ProxyStep event
|
||||
|
||||
ProxyCtrl event
|
||||
```
|
||||
|
||||
# Profiler instrumentation and logging
|
||||
|
||||
## Profiling of collective and p2p operations
|
||||
|
||||
The NCCL code is instrumented with profiler callbacks at different levels to capture start/stop of groups,
|
||||
collective and point-to-point operations, as well as proxy progress activity. Due to the asynchronous nature
|
||||
of NCCL operations, events associated to collective and point-to-point operations are not easy to delimit
|
||||
precisely. For example, without both proxy and/or kernel activity it is impossible for the profiler to
|
||||
figure out when a collective operation completes. Therefore, `stopEvent` for collectives simply indicates to
|
||||
the profiler that the collective has been enqueued. The profiler can leverage proxy event information, if
|
||||
these are enabled, to estimate when the collective ends. In this case, the profiler can look at the `stopEvent`
|
||||
call of the last `ncclProfileProxyOp` event to mark the completion of the associated collective event. This
|
||||
can be achieved by reference counting the collective event and letting calls to `startEvent` and `stopEvent`
|
||||
increment and decrement the reference counter, respectively.
|
||||
|
||||
## PXN
|
||||
|
||||
PXN causes some proxy operations to be processed in a remote proxy thread that differs from the one that
|
||||
generated the operation. When this happens, the event hierarchy reported above breaks. Because the
|
||||
profiler can use the hierarchy information, provided by NCCL in the event descriptor, to dereference the
|
||||
parent event during `startEvent`, the remote proxy thread must be in the same address space of the proxy
|
||||
thread originating the operation. To avoid the profiler instance in the remote proxy address space to
|
||||
dereference a pointer from another address space the event descriptor includes the PID of the originator.
|
||||
The profiler plugin needs to check that the originator PID matches the local PID before dereferencing the
|
||||
parent event.
|
239
ext-profiler/example/README.md
Normal file
239
ext-profiler/example/README.md
Normal file
@ -0,0 +1,239 @@
|
||||
# NCCL Example Profiler Plugin Usage
|
||||
|
||||
This page describes how to use the NCCL example profiler plugin
|
||||
|
||||
# Overview
|
||||
|
||||
The example profiler plugin implements the NCCL profiler plugin API introduced in NCCL v2.23. The API
|
||||
defines a set of events and data structures that NCCL uses to share event information with profiler
|
||||
plugins. The user can control what events are instrumented by NCCL and when traces collected by the
|
||||
profiler should be dumped through environment variables, as described in the rest of the document.
|
||||
The user can also control other profiler parameters that alter its behavior. For example, users can
|
||||
change the size of the event window the profiler keeps track of.
|
||||
|
||||
## Building the profiler plugin
|
||||
|
||||
To use the example plugin, just type `make`. You will need a NCCL build's include directory present.
|
||||
You can override `NCCL_HOME` to where the NCCL installation is on your system.
|
||||
|
||||
## Using the profiler plugin
|
||||
|
||||
1. Add the directory of this profiler plugin to your `LD_LIBRARY_PATH` or set the `NCCL_PROFILER_PLUGIN`,
|
||||
as documented in `ext-profiler/README.md`.
|
||||
|
||||
2. Set `NCCL_PROFILE_EVENT_MASK` bitmask to specify the NCCL events you want to instrument. By
|
||||
default, all collectives and send/recv operations will be traced. For more details about the event
|
||||
representation used by the profiler refer to `ext-profiler/README.md`.
|
||||
|
||||
As an example, setting:
|
||||
|
||||
`NCCL_PROFILE_EVENT_MASK` to 1 (`ncclProfileGroup`) | 2 (`ncclProfileColl`) | 8 (`ncclProfileProxyOp`)
|
||||
|
||||
enables the profiling of the group, the collective and the proxy op events. The same events can be
|
||||
expressed more concisely by setting `NCCL_PROFILE_EVENT_MASK` to 8 (`ncclProfileProxyOp`). Indeed,
|
||||
in NCCL all the events above (in the event hierarchy) the one requested are also captured. The advantage
|
||||
is that the profiler can easily correlate events that belong to the same NCCL operation and present
|
||||
them accordingly.
|
||||
|
||||
3. Set `NCCL_PROFILE_DUMP_FILE` to the name of the dump file for the collected traces. A file named
|
||||
${NCCL_PROFILE_DUMP_FILE}-hostname-tid.txt is created. Profiler traces are saved using the chrome
|
||||
event format (more precisely, using asynchronous events).
|
||||
|
||||
4. If you set the dump file variable, type chrome://tracing on your chromium browser search bar and
|
||||
open the created dump file to visualize the traces.
|
||||
|
||||
# Changing the profiler memory pool sizes
|
||||
|
||||
The example profiler uses separate memory pools for different types of events. The size of these memory
|
||||
pools (i.e., the # events) determines the number of events that the profiler can keep track of at the
|
||||
same time. When NCCL requests a new event (e.g., collective event) to profile a `ncclAllReduce`
|
||||
operation, by calling `startEvent`, the profiler searches in the collective pool for a free event. If it
|
||||
finds one, it marks it as in use and returns the handle to NCCL. If the pool is completely used the
|
||||
profiler returns `NULL` to NCCL and ignores all the following NCCL profiler calls for the `NULL` event
|
||||
handle. When the `ncclAllReduce` has been processed, NCCL calls `stopEvent` with the previosly returned
|
||||
event handle. The profiler has a total of 5 memory pools.
|
||||
|
||||
The group, collective and p2p pools contain objects for the corresponding events. The `ProxyCtrl` pool
|
||||
contains objects for `ProxyCtrl` events and the `ProxyDetach` pool contains objects for `ProxyOp` events
|
||||
generated by remote proxies. A list of pools and their size is reported below:
|
||||
|
||||
- `NCCL_PROFILE_GROUP_POOL_SIZE` (16)
|
||||
- `NCCL_PROFILE_COLL_POOL_SIZE` (16)
|
||||
- `NCCL_PROFILE_P2P_POOL_SIZE` (1024)
|
||||
- `NCCL_PROFILE_PROXY_CTRL_POOL_SIZE` (16)
|
||||
- `NCCL_PROFILE_PROXY_DETACH_POOL_SIZE` (128)
|
||||
|
||||
Remote proxy operations are generated when PXN is in use. Refer to this article for more information
|
||||
about PXN and how it works:
|
||||
https://developer.nvidia.com/blog/doubling-all2all-performance-with-nvidia-collective-communication-library-2-12/
|
||||
|
||||
# Reported events
|
||||
|
||||
The example profiler generates traces using the json format. An example of trace is reported below:
|
||||
|
||||
```
|
||||
[
|
||||
{"name": "Group", "cat": "GROUP", "ph": "b", "id": 0, "pid": 4157654, "tid": 1, "ts": 764234.611328, "args": {"groupId": 0}},
|
||||
{"name": "AllReduce", "cat": "COLL", "ph": "b", "id": 0, "pid": 4157654, "tid": 1, "ts": 764237.294922, "args": {"SeqNum": 0, "CommHash": 673864846479792718, "Rank": 1, "Count": 32768, "Datatype": "ncclFloat32", "Algorithm": "RING", "Protocol": "LL", "nMaxChannels": 2}},
|
||||
{"name": "Recv", "cat": "PROXY", "ph": "b", "id": 0, "pid": 4157654, "tid": 1, "ts": 768464.936523, "args": {"Channel": 0, "Peer": 0, "Steps": 14, "ChunkSize": 32768, "transSize": 229376, "POSTED": {"step": 14, "ts": 772020.300781}, "RECEIVED": {"step": 14, "ts": 772196.049805}, "TRANSMITTED": {"step": 14, "ts": 772197.326172}, "DONE": {"step": 14, "ts": 772201.538086}}},
|
||||
{"name": "RecvBufferWait", "cat": "NET", "ph": "b", "id": 0, "pid": 4157654, "tid": 1, "ts": 768465.158203, "args": {"Step": 0}},
|
||||
{"name": "RecvBufferWait", "cat": "NET", "ph": "e", "id": 0, "pid": 4157654, "tid": 1, "ts": 768477.924805},
|
||||
{"name": "RecvWait", "cat": "NET", "ph": "b", "id": 0, "pid": 4157654, "tid": 1, "ts": 768477.924805, "args": {"Step": 0}},
|
||||
{"name": "RecvWait", "cat": "NET", "ph": "e", "id": 0, "pid": 4157654, "tid": 1, "ts": 768547.197266},
|
||||
{"name": "RecvFlushWait", "cat": "NET", "ph": "b", "id": 0, "pid": 4157654, "tid": 1, "ts": 768547.197266, "args": {"Step": 0}},
|
||||
{"name": "RecvFlushWait", "cat": "NET", "ph": "e", "id": 0, "pid": 4157654, "tid": 1, "ts": 768564.174805},
|
||||
{"name": "RecvGpuWait", "cat": "NET", "ph": "b", "id": 0, "pid": 4157654, "tid": 1, "ts": 768564.174805, "args": {"Step": 0}},
|
||||
{"name": "RecvGpuWait", "cat": "NET", "ph": "e", "id": 0, "pid": 4157654, "tid": 1, "ts": 768568.276367},
|
||||
{"name": "RecvBufferWait", "cat": "NET", "ph": "b", "id": 1, "pid": 4157654, "tid": 1, "ts": 768503.604492, "args": {"Step": 1}},
|
||||
{"name": "RecvBufferWait", "cat": "NET", "ph": "e", "id": 1, "pid": 4157654, "tid": 1, "ts": 768504.549805},
|
||||
{"name": "RecvWait", "cat": "NET", "ph": "b", "id": 1, "pid": 4157654, "tid": 1, "ts": 768504.549805, "args": {"Step": 1}},
|
||||
{"name": "RecvWait", "cat": "NET", "ph": "e", "id": 1, "pid": 4157654, "tid": 1, "ts": 769994.490234},
|
||||
{"name": "RecvFlushWait", "cat": "NET", "ph": "b", "id": 1, "pid": 4157654, "tid": 1, "ts": 769994.490234, "args": {"Step": 1}},
|
||||
{"name": "RecvFlushWait", "cat": "NET", "ph": "e", "id": 1, "pid": 4157654, "tid": 1, "ts": 769995.012695},
|
||||
{"name": "RecvGpuWait", "cat": "NET", "ph": "b", "id": 1, "pid": 4157654, "tid": 1, "ts": 769995.012695, "args": {"Step": 1}},
|
||||
{"name": "RecvGpuWait", "cat": "NET", "ph": "e", "id": 1, "pid": 4157654, "tid": 1, "ts": 770006.914062},
|
||||
{"name": "RecvBufferWait", "cat": "NET", "ph": "b", "id": 2, "pid": 4157654, "tid": 1, "ts": 768506.941406, "args": {"Step": 2}},
|
||||
{"name": "RecvBufferWait", "cat": "NET", "ph": "e", "id": 2, "pid": 4157654, "tid": 1, "ts": 768507.435547},
|
||||
{"name": "RecvWait", "cat": "NET", "ph": "b", "id": 2, "pid": 4157654, "tid": 1, "ts": 768507.435547, "args": {"Step": 2}},
|
||||
{"name": "RecvWait", "cat": "NET", "ph": "e", "id": 2, "pid": 4157654, "tid": 1, "ts": 771452.536133},
|
||||
{"name": "RecvFlushWait", "cat": "NET", "ph": "b", "id": 2, "pid": 4157654, "tid": 1, "ts": 771452.536133, "args": {"Step": 2}},
|
||||
{"name": "RecvFlushWait", "cat": "NET", "ph": "e", "id": 2, "pid": 4157654, "tid": 1, "ts": 771453.060547},
|
||||
{"name": "RecvGpuWait", "cat": "NET", "ph": "b", "id": 2, "pid": 4157654, "tid": 1, "ts": 771453.060547, "args": {"Step": 2}},
|
||||
{"name": "RecvGpuWait", "cat": "NET", "ph": "e", "id": 2, "pid": 4157654, "tid": 1, "ts": 771468.458008},
|
||||
{"name": "RecvBufferWait", "cat": "NET", "ph": "b", "id": 3, "pid": 4157654, "tid": 1, "ts": 768509.484375, "args": {"Step": 3}},
|
||||
{"name": "RecvBufferWait", "cat": "NET", "ph": "e", "id": 3, "pid": 4157654, "tid": 1, "ts": 768510.250000},
|
||||
{"name": "RecvWait", "cat": "NET", "ph": "b", "id": 3, "pid": 4157654, "tid": 1, "ts": 768510.250000, "args": {"Step": 3}},
|
||||
{"name": "RecvWait", "cat": "NET", "ph": "e", "id": 3, "pid": 4157654, "tid": 1, "ts": 771904.499023},
|
||||
{"name": "RecvFlushWait", "cat": "NET", "ph": "b", "id": 3, "pid": 4157654, "tid": 1, "ts": 771904.499023, "args": {"Step": 3}},
|
||||
{"name": "RecvFlushWait", "cat": "NET", "ph": "e", "id": 3, "pid": 4157654, "tid": 1, "ts": 771904.991211},
|
||||
{"name": "RecvGpuWait", "cat": "NET", "ph": "b", "id": 3, "pid": 4157654, "tid": 1, "ts": 771904.991211, "args": {"Step": 3}},
|
||||
{"name": "RecvGpuWait", "cat": "NET", "ph": "e", "id": 3, "pid": 4157654, "tid": 1, "ts": 771910.500000},
|
||||
{"name": "Send", "cat": "PROXY", "ph": "b", "id": 1, "pid": 4157654, "tid": 1, "ts": 768482.878906, "args": {"Channel": 0, "Peer": 2, "Steps": 14, "ChunkSize": 32768, "transSize": 229376, "POSTED": {"step": 14, "ts": 771995.675781}, "REM_FIFO_WAIT": {"step": 14, "ts": 772190.692383}, "TRANSMITTED": {"step": 14, "ts": 772191.516602}, "DONE": {"step": 14, "ts": 772208.473633}}},
|
||||
{"name": "SendBufferWait", "cat": "NET", "ph": "b", "id": 14, "pid": 4157654, "tid": 1, "ts": 768483.019531, "args": {"Step": 0}},
|
||||
{"name": "SendBufferWait", "cat": "NET", "ph": "e", "id": 14, "pid": 4157654, "tid": 1, "ts": 768483.300781},
|
||||
{"name": "SendGpuWait", "cat": "NET", "ph": "b", "id": 14, "pid": 4157654, "tid": 1, "ts": 768483.300781, "args": {"Step": 0}},
|
||||
{"name": "SendGpuWait", "cat": "NET", "ph": "e", "id": 14, "pid": 4157654, "tid": 1, "ts": 769594.615234},
|
||||
{"name": "SendWait", "cat": "NET", "ph": "b", "id": 14, "pid": 4157654, "tid": 1, "ts": 769594.615234, "args": {"Step": 0}},
|
||||
{"name": "SendWait", "cat": "NET", "ph": "e", "id": 14, "pid": 4157654, "tid": 1, "ts": 769618.889648},
|
||||
{"name": "SendBufferWait", "cat": "NET", "ph": "b", "id": 15, "pid": 4157654, "tid": 1, "ts": 768505.083008, "args": {"Step": 1}},
|
||||
{"name": "SendBufferWait", "cat": "NET", "ph": "e", "id": 15, "pid": 4157654, "tid": 1, "ts": 768505.163086},
|
||||
{"name": "SendGpuWait", "cat": "NET", "ph": "b", "id": 15, "pid": 4157654, "tid": 1, "ts": 768505.163086, "args": {"Step": 1}},
|
||||
{"name": "SendGpuWait", "cat": "NET", "ph": "e", "id": 15, "pid": 4157654, "tid": 1, "ts": 769610.555664},
|
||||
{"name": "SendWait", "cat": "NET", "ph": "b", "id": 15, "pid": 4157654, "tid": 1, "ts": 769610.555664, "args": {"Step": 1}},
|
||||
{"name": "SendWait", "cat": "NET", "ph": "e", "id": 15, "pid": 4157654, "tid": 1, "ts": 769622.517578},
|
||||
{"name": "SendBufferWait", "cat": "NET", "ph": "b", "id": 16, "pid": 4157654, "tid": 1, "ts": 768507.937500, "args": {"Step": 2}},
|
||||
{"name": "SendBufferWait", "cat": "NET", "ph": "e", "id": 16, "pid": 4157654, "tid": 1, "ts": 768508.017578},
|
||||
{"name": "SendGpuWait", "cat": "NET", "ph": "b", "id": 16, "pid": 4157654, "tid": 1, "ts": 768508.017578, "args": {"Step": 2}},
|
||||
{"name": "SendGpuWait", "cat": "NET", "ph": "e", "id": 16, "pid": 4157654, "tid": 1, "ts": 770002.129883},
|
||||
{"name": "SendWait", "cat": "NET", "ph": "b", "id": 16, "pid": 4157654, "tid": 1, "ts": 770002.129883, "args": {"Step": 2}},
|
||||
{"name": "SendWait", "cat": "NET", "ph": "e", "id": 16, "pid": 4157654, "tid": 1, "ts": 770013.848633},
|
||||
{"name": "SendBufferWait", "cat": "NET", "ph": "b", "id": 17, "pid": 4157654, "tid": 1, "ts": 768510.742188, "args": {"Step": 3}},
|
||||
{"name": "SendBufferWait", "cat": "NET", "ph": "e", "id": 17, "pid": 4157654, "tid": 1, "ts": 768510.822266},
|
||||
{"name": "SendGpuWait", "cat": "NET", "ph": "b", "id": 17, "pid": 4157654, "tid": 1, "ts": 768510.822266, "args": {"Step": 3}},
|
||||
{"name": "SendGpuWait", "cat": "NET", "ph": "e", "id": 17, "pid": 4157654, "tid": 1, "ts": 771461.563477},
|
||||
{"name": "SendWait", "cat": "NET", "ph": "b", "id": 17, "pid": 4157654, "tid": 1, "ts": 771461.563477, "args": {"Step": 3}},
|
||||
{"name": "SendWait", "cat": "NET", "ph": "e", "id": 17, "pid": 4157654, "tid": 1, "ts": 771469.171875},
|
||||
... [ trace truncated for brevity ]
|
||||
{"name": "AllReduce", "cat": "COLL", "ph": "e", "id": 0, "pid": 4157654, "tid": 1, "ts": 772209.317383},
|
||||
{"name": "Group", "cat": "GROUP", "ph": "e", "id": 0, "pid": 4157654, "tid": 1, "ts": 772209.418945},
|
||||
{}]
|
||||
```
|
||||
|
||||
Details about the fields used in the trace can be found at this link:
|
||||
https://docs.google.com/document/d/1CvAClvFfyA5R-PhYUmn5OOQtYMH4h6I0nSsKchNAySU/preview?tab=t.0#heading=h.yr4qxyxotyw
|
||||
|
||||
The trace above is obtained by running a `ncclAllReduce` operation on 8 GPUs, communicating with each other through
|
||||
the network interface. The `Group` event encloses all traces that are related to the single `ncclAllReduce` call.
|
||||
(Note that for single collective invocations, where there are no explicit group calls, NCCL creates a group with only
|
||||
one collective and this is what is presented in the traces above).
|
||||
|
||||
|
||||
The `AllReduce` event encloses traces for the proxy operation associated to the `ncclAllReduce` operation. The `args`
|
||||
field in the traces contains NCCL specific information (aside from the chrome trace event format).
|
||||
|
||||
## AllReduce trace
|
||||
|
||||
The `AllReduce` entry presents information about the `ncclAllReduce` operation. It contains the following info in the args field:
|
||||
|
||||
- seqNum : sequential number of the collective in the communicator (every collective type has its own sequence number in the communicator)
|
||||
- commHash : communicator unique identifier
|
||||
- rank : NCCL rank for the ncclAllReduce
|
||||
- datatype : NCCL datatype
|
||||
- algorithm : algorithm used to process the ncclAllReduce
|
||||
- protocol : protocol used to process the ncclAllReduce
|
||||
- nMaxChannels: max number of channels used to process the ncclAllReduce
|
||||
|
||||
If the proxy events are not active (e.g., the `ncclAllReduce` is intranode) the end timestamp will match the time
|
||||
consumed by the CPU to launch the collective. For more details refer to `ext-profiler/README.md`, section `Profiling
|
||||
of collective and p2p operations`.
|
||||
|
||||
### Proxy Send
|
||||
The `Send` entry presents information about the `ProxyOp` processing in the progress thread. It contains the following
|
||||
info in the args field:
|
||||
|
||||
- Channel : id of the channel used by this proxy operation to send data to the peer
|
||||
- Peer : peer rank
|
||||
- Steps : number of network steps required to transfer transSize bytes to the peer
|
||||
- ChunkSize : chunk size used by NCCL to pipeline data through the proxy thread
|
||||
- transSize : bytes transferred across the channel by this proxy operation
|
||||
- POSTED : struct containing the number of buffer posts to the GPU and the time stamp for the last post
|
||||
- REM_FIFO_WAIT: struct containing the number of remote buffer waits and the time stamp for the last wait
|
||||
- TRANSMITTED : struct containing the number of network sends and the time stamp of the last send
|
||||
- DONE : struct containing the number of network sends completed and the time stamp of the last send completed
|
||||
|
||||
In case of a network problem the POSTED, REM_FIFO_WAIT, TRANSMITTED and DONE might all have partially updated steps,
|
||||
which could help identify at which point the network problem occurred.
|
||||
|
||||
The Proxy send trace gives a summary of the proxy progress thread activity for the channel. If more details are
|
||||
needed, these can be obtained by enabling the proxy step event (`ncclProfileProxyStep`). In which case the trace
|
||||
entries below are also reported by the profiler.
|
||||
|
||||
#### Proxy SendBufferWait
|
||||
|
||||
Presents, for every network step, the time the CPU proxy spends waiting for the channel staging buffer to become available.
|
||||
|
||||
#### Proxy SendGPUWait
|
||||
|
||||
Presents, for every network step, the time the CPU proxy spends waiting for the GPU to provide the data in the staging
|
||||
buffer.
|
||||
|
||||
#### Proxy SendWait
|
||||
|
||||
Presents, for every network step, the time the CPU proxy spends waiting for the `isend` to complete
|
||||
|
||||
### Proxy Recv
|
||||
|
||||
The `Recv` entry presents information about the `ProxyOp` processing in the progress thread. It contains the following
|
||||
info in the args field:
|
||||
|
||||
- Channel : id of the channel used by this proxy operation to recv data from the peer
|
||||
- Peer : peer rank
|
||||
- Steps : number of network steps required to transfer transSize bytes from the peer
|
||||
- ChunkSize : chunk size used by NCCL to pipeline data through the proxy thread
|
||||
- transSize : bytes transferred across the channel by this proxy operation
|
||||
- POSTED : struct containing the number of recvs posted and the time stamp for the last recv posted
|
||||
- RECEIVED : struct containing the number of recvs completed and the time stamp for the last recv completed
|
||||
- TRANSMITTED: struct containing the number of recvs flushed to the GPU memory and the time stamp for the last recv flushed
|
||||
- DONE : struct containing the number of flush completed and the time stamp for the last flush completed
|
||||
|
||||
The Proxy Recv trace gives a summary of the proxy progress thread activity for the channel. If more details are
|
||||
needed, these can be obtained by enabling the proxy step event (`ncclProfileProxyStep`). In which case the trace
|
||||
entries below are also reported by the profiler.
|
||||
|
||||
|
||||
#### Proxy RecvBufferWait
|
||||
|
||||
Presents, for every network step, the time the CPU proxy spends waiting for the staging buffer for the channel to
|
||||
become available.
|
||||
|
||||
#### Proxy RecvWait
|
||||
|
||||
Presents, for every network step, the time the CPU proxy spends waiting for a posted `irecv` to complete
|
||||
|
||||
#### Proxy RecvFlushWait
|
||||
|
||||
Presents, for every network step, the time the CPU proxy spends waitng for the recv data to be flushed to the GPU
|
||||
|
||||
#### Proxy RecvGPUWait
|
||||
|
||||
Presents, for every network step, the time the CPU proxy spends waiting for the GPU to consume the recv data
|
Loading…
x
Reference in New Issue
Block a user