Skip to content

[perf] fix: use direct assignment for NCCL env vars when nccl_ub enabled#3350

Open
dingqingy-nv wants to merge 2 commits intoNVIDIA-NeMo:mainfrom
dingqingy-nv:fix/nccl-ub-env-direct-assignment
Open

[perf] fix: use direct assignment for NCCL env vars when nccl_ub enabled#3350
dingqingy-nv wants to merge 2 commits intoNVIDIA-NeMo:mainfrom
dingqingy-nv:fix/nccl-ub-env-direct-assignment

Conversation

@dingqingy-nv
Copy link
Copy Markdown
Contributor

@dingqingy-nv dingqingy-nv commented Apr 16, 2026

Summary

Test plan

  • CI passes
  • Verify NCCL env vars are correctly set when recipe.ddp.nccl_ub=True, even on clusters with NCCL_NVLS_ENABLE=0 default

🤖 Generated with Claude Code

Summary by CodeRabbit

  • Bug Fixes
    • Updated distributed training configuration to enforce NCCL environment settings when UB optimization is enabled, now overriding existing cluster settings for consistent performance behavior.

setdefault() does not override pre-existing values, so clusters that
default NCCL_NVLS_ENABLE=0 in their environment would silently break
nccl_ub when enabled via recipe config. Switch to direct assignment to
enforce the correct values, consistent with setup_experiment.py which
already uses dict.update().

Co-Authored-By: Claude Opus 4.6 (1M context) <[email protected]>
Signed-off-by: Dingqing Yang <[email protected]>
Co-Authored-By: Claude Opus 4.6 (1M context) <[email protected]>
Signed-off-by: Dingqing Yang <[email protected]>
@coderabbitai
Copy link
Copy Markdown
Contributor

coderabbitai bot commented Apr 16, 2026

📝 Walkthrough

Walkthrough

The script modifies NCCL environment variable handling when distributed data parallel (DDP) with NCCL is enabled, switching from conditional setting via setdefault() to force-setting the NCCL_NVLS_ENABLE and NCCL_CTA_POLICY environment variables, which overrides any existing cluster defaults.

Changes

Cohort / File(s) Summary
NCCL Environment Variable Setup
scripts/performance/run_script.py
Modified to force-set NCCL environment variables instead of using setdefault() when recipe.ddp.nccl_ub is enabled, overriding existing values rather than preserving them.

Estimated code review effort

🎯 2 (Simple) | ⏱️ ~10 minutes

Possibly related PRs

Suggested labels

performance, performance/release, area:perf, needs-review

Suggested reviewers

  • erhoo82
  • malay-nagda
  • yaoyu-33
🚥 Pre-merge checks | ✅ 3 | ❌ 1

❌ Failed checks (1 warning)

Check name Status Explanation Resolution
Test Results For Major Changes ⚠️ Warning PR modifies NCCL environment variables affecting distributed training but lacks test results and performance metrics required by custom check. Add CI test results, NCCL variable configuration verification across cluster types, performance metrics, and address OOM concerns with guards.
✅ Passed checks (3 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The pull request title clearly and concisely describes the main change: switching from setdefault() to direct assignment for NCCL environment variables when nccl_ub is enabled. The title is specific, meaningful, and directly related to the changeset.
Docstring Coverage ✅ Passed Docstring coverage is 100.00% which is sufficient. The required threshold is 80.00%.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@scripts/performance/run_script.py`:
- Around line 101-104: The code unconditionally forces NCCL_NVLS_ENABLE=1 when
recipe.ddp.nccl_ub is true; instead add a guard that (a) checks for an explicit
user override (os.environ.get("NCCL_NVLS_ENABLE") or a new config flag
recipe.ddp.nccl_nvls_enable) and respects it, and (b) if you must change an
existing non-default value, emit a clear warning via the same logger used
elsewhere (e.g., processLogger or the module logger) indicating you are
overriding the user's setting and why; implement this logic in run_script.py
around the nccl_ub branch and mirror the same change for the identical pattern
in setup_experiment.py so both use either the explicit
recipe.ddp.nccl_nvls_enable flag or respect pre-existing NCCL_NVLS_ENABLE
environment variables and log overrides.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro Plus

Run ID: 0f8d138a-54f0-4d91-a108-6565e695c3a8

📥 Commits

Reviewing files that changed from the base of the PR and between b8b13d3 and 71883d8.

📒 Files selected for processing (1)
  • scripts/performance/run_script.py

Comment thread scripts/performance/run_script.py Outdated
Comment on lines +101 to +104
# Use direct assignment to override cluster defaults (e.g. NCCL_NVLS_ENABLE=0).
if getattr(recipe.ddp, "nccl_ub", False):
os.environ.setdefault("NCCL_NVLS_ENABLE", "1")
os.environ.setdefault("NCCL_CTA_POLICY", "1")
os.environ["NCCL_NVLS_ENABLE"] = "1"
os.environ["NCCL_CTA_POLICY"] = "1"
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

🧩 Analysis chain

🏁 Script executed:

#!/bin/bash
set -euo pipefail

echo "1) Inspect all NCCL_NVLS_ENABLE / nccl_ub handling in performance scripts:"
rg -n -C3 'NCCL_NVLS_ENABLE|NCCL_CTA_POLICY|nccl_ub' scripts/performance

echo
echo "2) Check for any explicit user-facing knob to disable NVLS while nccl_ub is enabled:"
rg -n -C2 'nccl_nvls|nvls_enable|force.*nvls|disable.*nvls' scripts/performance

echo
echo "3) Confirm known OOM annotation context:"
rg -n -C2 'OOM|NCCL_NVLS_ENABLE' scripts/performance/perf_plugins.py

Repository: NVIDIA-NeMo/Megatron-Bridge

Length of output: 12035


Add escape hatch or warning when forcing NCCL_NVLS_ENABLE=1 for nccl_ub.

This change unconditionally sets NCCL_NVLS_ENABLE=1 when recipe.ddp.nccl_ub is enabled, overriding the default of "0" (set in scripts/performance/utils/executors.py). However, scripts/performance/perf_plugins.py explicitly documents that enabling this causes OOM in certain configurations. Without an escape hatch or warning when overriding non-default values already set by users or operators, this becomes a reliability hazard.

Add a guard that either:

  1. Respects pre-existing user overrides and logs when forcing a change, or
  2. Exposes a new config knob (e.g., recipe.ddp.nccl_nvls_enable) to give users explicit control.

The same pattern also appears in scripts/performance/setup_experiment.py (lines 340-341) and should be treated consistently.

Example: Guard with logging
 if getattr(recipe.ddp, "nccl_ub", False):
+    prev_nvls = os.environ.get("NCCL_NVLS_ENABLE")
     os.environ["NCCL_NVLS_ENABLE"] = "1"
+    if prev_nvls and prev_nvls != "1":
+        import logging
+        logging.getLogger(__name__).warning(
+            "Forcing NCCL_NVLS_ENABLE=1 (was %s) due to nccl_ub=True. "
+            "This may cause OOM in some configurations.",
+            prev_nvls,
+        )
     os.environ["NCCL_CTA_POLICY"] = "1"
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@scripts/performance/run_script.py` around lines 101 - 104, The code
unconditionally forces NCCL_NVLS_ENABLE=1 when recipe.ddp.nccl_ub is true;
instead add a guard that (a) checks for an explicit user override
(os.environ.get("NCCL_NVLS_ENABLE") or a new config flag
recipe.ddp.nccl_nvls_enable) and respects it, and (b) if you must change an
existing non-default value, emit a clear warning via the same logger used
elsewhere (e.g., processLogger or the module logger) indicating you are
overriding the user's setting and why; implement this logic in run_script.py
around the nccl_ub branch and mirror the same change for the identical pattern
in setup_experiment.py so both use either the explicit
recipe.ddp.nccl_nvls_enable flag or respect pre-existing NCCL_NVLS_ENABLE
environment variables and log overrides.

@yaoyu-33 yaoyu-33 added bug Something isn't working area:perf Performance optimizations and benchmarking needs-review PR is ready for code review and waiting on a reviewer ready-to-merge PR is approved, current, and only waiting for CI to pass before merge and removed needs-review PR is ready for code review and waiting on a reviewer labels Apr 16, 2026
@dingqingy-nv dingqingy-nv added the r0.4.0 Auto-cherrypick to release branch. Apply before merge; cherrypick happens after merge. label Apr 16, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

area:perf Performance optimizations and benchmarking bug Something isn't working r0.4.0 Auto-cherrypick to release branch. Apply before merge; cherrypick happens after merge. ready-to-merge PR is approved, current, and only waiting for CI to pass before merge

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants