[perf] fix: use direct assignment for NCCL env vars when nccl_ub enabled#3350
[perf] fix: use direct assignment for NCCL env vars when nccl_ub enabled#3350dingqingy-nv wants to merge 2 commits intoNVIDIA-NeMo:mainfrom
Conversation
setdefault() does not override pre-existing values, so clusters that default NCCL_NVLS_ENABLE=0 in their environment would silently break nccl_ub when enabled via recipe config. Switch to direct assignment to enforce the correct values, consistent with setup_experiment.py which already uses dict.update(). Co-Authored-By: Claude Opus 4.6 (1M context) <[email protected]> Signed-off-by: Dingqing Yang <[email protected]>
Co-Authored-By: Claude Opus 4.6 (1M context) <[email protected]> Signed-off-by: Dingqing Yang <[email protected]>
📝 WalkthroughWalkthroughThe script modifies NCCL environment variable handling when distributed data parallel (DDP) with NCCL is enabled, switching from conditional setting via Changes
Estimated code review effort🎯 2 (Simple) | ⏱️ ~10 minutes Possibly related PRs
Suggested labels
Suggested reviewers
🚥 Pre-merge checks | ✅ 3 | ❌ 1❌ Failed checks (1 warning)
✅ Passed checks (3 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches🧪 Generate unit tests (beta)
Comment |
There was a problem hiding this comment.
Actionable comments posted: 1
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@scripts/performance/run_script.py`:
- Around line 101-104: The code unconditionally forces NCCL_NVLS_ENABLE=1 when
recipe.ddp.nccl_ub is true; instead add a guard that (a) checks for an explicit
user override (os.environ.get("NCCL_NVLS_ENABLE") or a new config flag
recipe.ddp.nccl_nvls_enable) and respects it, and (b) if you must change an
existing non-default value, emit a clear warning via the same logger used
elsewhere (e.g., processLogger or the module logger) indicating you are
overriding the user's setting and why; implement this logic in run_script.py
around the nccl_ub branch and mirror the same change for the identical pattern
in setup_experiment.py so both use either the explicit
recipe.ddp.nccl_nvls_enable flag or respect pre-existing NCCL_NVLS_ENABLE
environment variables and log overrides.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Pro Plus
Run ID: 0f8d138a-54f0-4d91-a108-6565e695c3a8
📒 Files selected for processing (1)
scripts/performance/run_script.py
| # Use direct assignment to override cluster defaults (e.g. NCCL_NVLS_ENABLE=0). | ||
| if getattr(recipe.ddp, "nccl_ub", False): | ||
| os.environ.setdefault("NCCL_NVLS_ENABLE", "1") | ||
| os.environ.setdefault("NCCL_CTA_POLICY", "1") | ||
| os.environ["NCCL_NVLS_ENABLE"] = "1" | ||
| os.environ["NCCL_CTA_POLICY"] = "1" |
There was a problem hiding this comment.
🧩 Analysis chain
🏁 Script executed:
#!/bin/bash
set -euo pipefail
echo "1) Inspect all NCCL_NVLS_ENABLE / nccl_ub handling in performance scripts:"
rg -n -C3 'NCCL_NVLS_ENABLE|NCCL_CTA_POLICY|nccl_ub' scripts/performance
echo
echo "2) Check for any explicit user-facing knob to disable NVLS while nccl_ub is enabled:"
rg -n -C2 'nccl_nvls|nvls_enable|force.*nvls|disable.*nvls' scripts/performance
echo
echo "3) Confirm known OOM annotation context:"
rg -n -C2 'OOM|NCCL_NVLS_ENABLE' scripts/performance/perf_plugins.pyRepository: NVIDIA-NeMo/Megatron-Bridge
Length of output: 12035
Add escape hatch or warning when forcing NCCL_NVLS_ENABLE=1 for nccl_ub.
This change unconditionally sets NCCL_NVLS_ENABLE=1 when recipe.ddp.nccl_ub is enabled, overriding the default of "0" (set in scripts/performance/utils/executors.py). However, scripts/performance/perf_plugins.py explicitly documents that enabling this causes OOM in certain configurations. Without an escape hatch or warning when overriding non-default values already set by users or operators, this becomes a reliability hazard.
Add a guard that either:
- Respects pre-existing user overrides and logs when forcing a change, or
- Exposes a new config knob (e.g.,
recipe.ddp.nccl_nvls_enable) to give users explicit control.
The same pattern also appears in scripts/performance/setup_experiment.py (lines 340-341) and should be treated consistently.
Example: Guard with logging
if getattr(recipe.ddp, "nccl_ub", False):
+ prev_nvls = os.environ.get("NCCL_NVLS_ENABLE")
os.environ["NCCL_NVLS_ENABLE"] = "1"
+ if prev_nvls and prev_nvls != "1":
+ import logging
+ logging.getLogger(__name__).warning(
+ "Forcing NCCL_NVLS_ENABLE=1 (was %s) due to nccl_ub=True. "
+ "This may cause OOM in some configurations.",
+ prev_nvls,
+ )
os.environ["NCCL_CTA_POLICY"] = "1"🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@scripts/performance/run_script.py` around lines 101 - 104, The code
unconditionally forces NCCL_NVLS_ENABLE=1 when recipe.ddp.nccl_ub is true;
instead add a guard that (a) checks for an explicit user override
(os.environ.get("NCCL_NVLS_ENABLE") or a new config flag
recipe.ddp.nccl_nvls_enable) and respects it, and (b) if you must change an
existing non-default value, emit a clear warning via the same logger used
elsewhere (e.g., processLogger or the module logger) indicating you are
overriding the user's setting and why; implement this logic in run_script.py
around the nccl_ub branch and mirror the same change for the identical pattern
in setup_experiment.py so both use either the explicit
recipe.ddp.nccl_nvls_enable flag or respect pre-existing NCCL_NVLS_ENABLE
environment variables and log overrides.
Summary
fix(perf): set NCCL env vars when nccl_ub enabled via recipe config (3283)intor0.4.0#3305 / fix(perf): set NCCL env vars when nccl_ub enabled via recipe config #3283os.environ.setdefault()does not override pre-existing values, so clusters that defaultNCCL_NVLS_ENABLE=0in their container environment would silently breaknccl_ubwhen enabled via recipe configos.environ["KEY"] = "1"), consistent withsetup_experiment.pywhich already usesdict.update()for the same env varsTest plan
recipe.ddp.nccl_ub=True, even on clusters withNCCL_NVLS_ENABLE=0default🤖 Generated with Claude Code
Summary by CodeRabbit