Improvements for GB200 systems * Optimize the network performance by alternating the direction of the rings and the NIC to GPU assignment across communicators to limit unnecessary sharing. * Fix the detection of C2C links in case GPU Direct RDMA is disabled between a GPU and a NIC. * Fix PXN support on MNNVL systems, where NCCL would try (and fail) to share regular host memory across multiple nodes. * Fix P2C (PXN over C2C), which is now preferred over regular PXN. This support is currently preliminary and is disabled by default; use NCCL_PXN_C2C=1 to enable. Further reduce the overheads of CUDA graph capturing, which increased in NCCL 2.26.2 for large graphs. Optimize the network performance on DGX B200 systems by adjusting the bandwidths provided to the graph search algorithm. Enable fp8 reductions in symmetric kernels on Blackwell with CUDA 12.8. Restore the plugin name handling logic to make it possible to specify a path to the plugin (Issue #1732). Restore the ability to change NCCL_COLLNET_ENABLE during execution (Issue #1741). Add an example tuner plugin with CSV-based overrides. Remove an x86 dependency from the example profiler.
NCCL Example Tuner Plugin
This example plugin shows a practical example of a CSV file-based tuning approach, allowing selective overrides for tuning parameters based on all tuning inputs without recompiling.
Features
- File-based Configuration: Read tuning parameters from a CSV configuration file
- Size-based Tuning: Specify different configurations based on message size ranges
- Dimension-aware Tuning: Match configurations based on number of nodes and ranks
- Optional Channels Configuration: Set specific channel counts or use -1 to keep NCCL's default
- Environment Variable Support: Specify config file location via
NCCL_TUNER_CONFIG_FILE
- Fallback Behavior: Gracefully handles missing config files and invalid entries
Building
make
This will create libnccl-tuner-example.so
that can be loaded by NCCL.
Configuration File Format
The configuration file uses CSV (Comma-Separated Values) format with one configuration per line:
collective_type,min_bytes,max_bytes,algorithm,protocol,channels,nNodes,nRanks,numPipeOps,regBuff
Parameters
-
collective_type: The collective operation type
broadcast
,reduce
,allgather
,reducescatter
,allreduce
-
min_bytes/max_bytes: The message size range (in bytes) for which this config applies
- Use
0
for minimum and4294967295
for maximum (covers all sizes)
- Use
-
algorithm: The NCCL algorithm to use
tree
,ring
,collnet_direct
,collnet_chain
,nvls
,nvls_tree
,pat
-
protocol: The NCCL protocol to use
ll
,ll128
,simple
-
channels: Number of channels (SMs) to use
- Use a positive integer to specify exact channel count
- Use
-1
to keep NCCL's default channel selection
-
nNodes: Number of nodes to match
- Use a positive integer to match specific node count
- Use
-1
to match any number of nodes
-
nRanks: Number of ranks to match
- Use a positive integer to match specific rank count
- Use
-1
to match any number of ranks
-
numPipeOps: Number of pipeline operations to match (optional)
- Use a positive integer to match specific pipeline operation count
- Use
-1
to match any number of pipeline operations - If omitted, configuration will match any numPipeOps value
-
regBuff: Whether user buffer can be registered (optional)
- Use
0
to match only non-registered buffers - Use
1
to match only registered buffers - Use
-1
to match either registered or non-registered buffers - If omitted, configuration will match any regBuff value
- Use
Example Configuration
# Single-node, small allreduce: use tree algorithm, registered buffers only
allreduce,0,65536,tree,simple,2,1,-1,-1,1
# 4-node, 32-rank setup: medium allreduce, single pipeline op, non-registered buffers
allreduce,65537,1048576,ring,simple,4,4,32,1,0
# Any topology: large allreduce with LL128, multiple pipeline ops, any buffer type
allreduce,1048577,4294967295,ring,ll128,-1,-1,-1,4,-1
# Single-node broadcast: prefer tree, any pipeOps, registered buffers (backward compatible)
broadcast,0,32768,tree,simple,-1,1,-1
# Multi-node broadcast: optimized for non-registered buffers, single pipeline op
broadcast,32769,4294967295,ring,simple,2,-1,-1,1,0
Comments start with #
and empty lines are ignored. The CSV format makes it easy to edit configurations in spreadsheet applications like Excel, Google Sheets, or LibreOffice Calc.
Backward Compatibility
Configurations without the numPipeOps and/or regBuff parameters are fully supported:
- 8 fields: matches any numPipeOps and regBuff values
- 9 fields: matches any regBuff value
- 10 fields: full parameter specification
This ensures existing configuration files continue to work without modification.
Usage
Method 1: Default Config File
Place your configuration in nccl_tuner.conf
in the current working directory.
Method 2: Environment Variable
Set the NCCL_TUNER_CONFIG_FILE
environment variable to specify the config file path:
export NCCL_TUNER_CONFIG_FILE=/path/to/your/tuner.conf
export LD_LIBRARY_PATH=/path/to/plugin:$LD_LIBRARY_PATH
mpirun -np 4 your_nccl_application
Editing Configuration Files
Generating Configuration Files from Raw Data
A python script to generate valid CSV configs has been provided. Using optimize_config.py.
Spreadsheet Tips:
- Use column headers:
collective_type,min_bytes,max_bytes,algorithm,protocol,channels,nNodes,nRanks,numPipeOps,regBuff
- Save as CSV format (not Excel format) for the plugin to read
- Use data validation to prevent typos in algorithm/protocol names
Logging
The plugin uses NCCL's logging system. To see tuner-related messages:
export NCCL_DEBUG=INFO
This will show when configurations are loaded and applied, including the topology information.
For detailed debugging output during tuning decisions:
export NCCL_DEBUG=TRACE
This will show verbose information about which configurations are being evaluated and matched.
Dimension Matching
Configurations are only applied when the topology matches:
- Exact Match: Configuration specifies
nNodes=4,nRanks=32
, only applied when communicator has exactly 4 nodes and 32 ranks - Wildcard Nodes: Configuration specifies
nNodes=-1,nRanks=8
, applied to any topology with exactly 8 ranks - Wildcard Ranks: Configuration specifies
nNodes=2,nRanks=-1
, applied to any 2-node topology regardless of ranks per node - Wildcard Both: Configuration specifies
nNodes=-1,nRanks=-1
, applied to any topology
This allows you to create specialized configurations for different cluster setups while maintaining flexibility.
Default Behavior
If no configuration file is found or no matching configuration exists for a collective operation, the plugin falls back to preferring the ring algorithm with simple protocol. All configured algorithm/protocol combinations are given a low cost (0.0) to make them preferred by NCCL's selection logic.
When channels is set to -1
, NCCL's default channel selection logic is preserved, allowing the system to automatically determine the optimal number of channels based on hardware and message size.
Troubleshooting
- Config file not found: Check the file path and permissions
- Configurations not applied: Verify the collective type, size ranges, algorithm/protocol names, and topology parameters
- Plugin not loaded: Ensure
LD_LIBRARY_PATH
includes the plugin directory - No effect on performance: Check that NCCL is actually using the tuner plugin with
NCCL_DEBUG=INFO
- Topology mismatch: Verify that nNodes and nRanks match your actual setup, or use -1 for wildcards
- CSV parsing errors: Ensure no spaces after commas, or quote fields containing spaces