Merge pull request #1217 from crazy-JiangDongHua/bugfix_undo_plan

Bug in plan enqueue logic where plans could be silently not launched for some communicators. Triggered when both are true:
1. Multiple communicators per ncclGroup.
2. Communicators within a group have different plan counts.
2. Intra-process launch barrier disabled.
This commit is contained in:
jbachan 2024-03-18 10:12:26 -07:00 committed by GitHub
commit 6dd51f15bf
No known key found for this signature in database
GPG Key ID: B5690EEEBB952194

View File

@ -142,7 +142,7 @@ static ncclResult_t doLaunches(struct ncclComm* head) {
}
while (true) { // Iterate rounds of launches for clique.
bool moreRounds;
bool moreRounds = false;
comm = cliqueHead;
do { // Iterate clique members.
struct ncclComm* next = comm->groupNext;
@ -150,7 +150,7 @@ static ncclResult_t doLaunches(struct ncclComm* head) {
// Barrier reduction result tells us if this was the final round.
moreRounds = 0 != ncclCommIntraBarrierOut(comm);
} else {
moreRounds = comm->unlaunchedPlansHead != nullptr;
moreRounds |= comm->unlaunchedPlansHead != nullptr;
}
if (moreRounds) {
// Pop next unlaunched kernel