Skip to content

VR200 cfgs, Lm3 70b, 405b, qwen3 30b, 235b, gpt-oss, kimi#3374

Open
malay-nagda wants to merge 4 commits intomainfrom
malay/vr200_cfgs
Open

VR200 cfgs, Lm3 70b, 405b, qwen3 30b, 235b, gpt-oss, kimi#3374
malay-nagda wants to merge 4 commits intomainfrom
malay/vr200_cfgs

Conversation

@malay-nagda
Copy link
Copy Markdown
Contributor

@malay-nagda malay-nagda commented Apr 17, 2026

What does this PR do ?

Adds support for VR200 system for the models- Llama3 70B, Llama3 405B, Qwen3 30B_a3B, Qwen3 235B_a22B, GPT-OSS 120B and Kimi-K2.

Changelog

    GPT_OSS_120B_PRETRAIN_CONFIG_VR200_BF16_V2,
    GPT_OSS_120B_PRETRAIN_CONFIG_VR200_FP8_MX_V2
    LLAMA3_70B_PRETRAIN_CONFIG_VR200_BF16_V2,
    LLAMA3_70B_PRETRAIN_CONFIG_VR200_FP8_MX_V2,
    LLAMA3_70B_PRETRAIN_CONFIG_VR200_NVFP4_V2,

    LLAMA31_405B_PRETRAIN_CONFIG_VR200_BF16_V2,
    LLAMA31_405B_PRETRAIN_CONFIG_VR200_FP8_MX_V2,
    LLAMA31_405B_PRETRAIN_CONFIG_VR200_NVFP4_V2,
    QWEN3_30B_A3B_PRETRAIN_CONFIG_VR200_BF16_V1,
    QWEN3_30B_A3B_PRETRAIN_CONFIG_VR200_FP8_MX_V1,

    QWEN3_235B_A22B_PRETRAIN_CONFIG_VR200_BF16_V2,
    QWEN3_235B_A22B_PRETRAIN_CONFIG_VR200_FP8_MX_V2,
    QWEN3_235B_A22B_PRETRAIN_CONFIG_VR200_NVFP4_V2,
    KIMI_K2_PRETRAIN_CONFIG_VR200_BF16,
    KIMI_K2_PRETRAIN_CONFIG_VR200_FP8_MX,

GitHub Actions CI

See the CI sectionin the Contributing doc for how to trigger the CI. A Nvidia developer will need to approve and trigger the CI for external contributors.

Before your PR is "Ready for review"

Pre checks:

  • Make sure you read and followed Contributor guidelines
  • Did you write any new necessary tests?
  • Did you add or update any necessary documentation?
  • Does the PR affect components that are optional to install? (Ex: Numba, Pynini, Apex etc)
    • Reviewer: Does the PR have correct import guards for all optional libraries?

If you haven't finished some of the above items you can still open "Draft" PR.

Additional Information

  • Related to # (issue)

Summary by CodeRabbit

New Features

  • Added VR200 hardware target support for multiple AI models: GPT-OSS 120B, Llama (70B and 405B), Kimi K2, and Qwen3 (30B and 235B)
  • Introduced configuration variants with BF16, FP8_MX, and NVFP4 precision options for supported models on VR200 hardware

Signed-off-by: Malay Nagda <malayn@nvidia.com>
@copy-pr-bot
Copy link
Copy Markdown

copy-pr-bot bot commented Apr 17, 2026

Auto-sync is disabled for draft pull requests in this repository. Workflows must be run manually.

Contributors can view more details about this message here.

Signed-off-by: Malay Nagda <malayn@nvidia.com>
@malay-nagda malay-nagda changed the title vr200 cfgs, Lm3 70b, 405b, qwen3 30b, 235b, gpt-oss, kimi VR200 cfgs, Lm3 70b, 405b, qwen3 30b, 235b, gpt-oss, kimi Apr 17, 2026
Signed-off-by: Malay Nagda <malayn@nvidia.com>
@malay-nagda malay-nagda marked this pull request as ready for review April 17, 2026 09:59
@coderabbitai
Copy link
Copy Markdown
Contributor

coderabbitai bot commented Apr 17, 2026

📝 Walkthrough

Walkthrough

The PR adds VR200 hardware target support for pretraining configurations across four model families: GPT-OSS 120B, Kimi K2, Llama 70B/405B, and Qwen3 30B/235B. For each model, new VR200-specific configuration factory functions are introduced alongside workload base configuration aliases for multiple precision formats.

Changes

Cohort / File(s) Summary
GPT-OSS VR200 Support
scripts/performance/configs/gpt_oss/__init__.py, scripts/performance/configs/gpt_oss/gpt_oss_llm_pretrain.py, scripts/performance/configs/gpt_oss/gpt_oss_workload_base_configs.py
Added gpt_oss_120b_pretrain_config_vr200() factory function with mixed precision support, MoE flex dispatcher backend, and communication overlap configuration. Added BF16_V2 and FP8_MX_V2 workload base config aliases and corresponding exports.
Kimi VR200 Support
scripts/performance/configs/kimi/__init__.py, scripts/performance/configs/kimi/kimi_llm_pretrain.py, scripts/performance/configs/kimi/kimi_workload_base_configs.py
Added kimi_k2_pretrain_config_vr200() factory function with pipeline layout configuration and gradient reduce overlap. Created BF16 and FP8_MX workload base config aliases referencing existing GB200 configurations.
Llama 70B/405B VR200 Support
scripts/performance/configs/llama/__init__.py, scripts/performance/configs/llama/llama3_llm_pretrain.py, scripts/performance/configs/llama/llama3_workload_base_configs.py, scripts/performance/configs/llama/llama31_llm_pretrain.py, scripts/performance/configs/llama/llama31_workload_base_configs.py
Added llama3_70b_pretrain_config_vr200() and llama31_405b_pretrain_config_vr200() factory functions with precision-specific tensor-parallel overlap presets, FSDP configuration adjustments, and distributed optimizer settings. Added corresponding BF16_V2, FP8_MX_V2, and NVFP4_V2 workload base config aliases.
Qwen3 VR200 Support
scripts/performance/configs/qwen/__init__.py, scripts/performance/configs/qwen/qwen3_llm_pretrain.py, scripts/performance/configs/qwen/qwen3_workload_base_configs.py
Added qwen3_30b_a3b_pretrain_config_vr200() and qwen3_235b_a22b_pretrain_config_vr200() factory functions with MoE flex dispatcher backend and token dispatcher configuration. Added BF16/FP8_MX variants for 30B and BF16/FP8_MX/NVFP4 variants for 235B as workload base config aliases.

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~25 minutes

Possibly related PRs

  • Onboard NVFP4 and MXFP8 recipes #2600: Shares modifications to GPT-OSS configuration surfaces (gpt_oss/init.py, gpt_oss_llm_pretrain.py, gpt_oss_workload_base_configs.py) and applies the MoE flex dispatcher backend pattern.

Suggested labels

performance, performance/release, r0.3.0

Suggested reviewers

  • ko3n1g
🚥 Pre-merge checks | ✅ 4
✅ Passed checks (4 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The title lists the main hardware target (VR200) and the model configurations being added, accurately summarizing the primary changes.
Docstring Coverage ✅ Passed Docstring coverage is 100.00% which is sufficient. The required threshold is 80.00%.
Test Results For Major Changes ✅ Passed PR contains minor configuration additions (~285 lines) aliasing GB200 to VR200 with no algorithm, numerics, or baseline performance changes.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
📝 Generate docstrings
  • Create stacked PR
  • Commit on current branch
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Commit unit tests in branch malay/vr200_cfgs

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 4

🧹 Nitpick comments (1)
scripts/performance/configs/kimi/kimi_llm_pretrain.py (1)

140-177: Consider extracting a shared Kimi pretrain-config builder.

This new VR200 function duplicates the same setup path used by other GPU-specific factories. A small internal helper would reduce drift risk between targets.

♻️ Refactor sketch
+def _kimi_k2_pretrain_config_for_gpu(
+    *,
+    gpu: str,
+    precision: str = "bf16",
+    config_variant: str = "v1",
+    optimizer_type: str = "muon",
+) -> ConfigContainer:
+    base_cfg = get_workload_base_config(
+        model_family_name="kimi",
+        model_recipe_name="kimi_k2",
+        gpu=gpu,
+        compute_dtype=precision.upper(),
+        task="pretrain",
+        config_variant=config_variant,
+    )
+    cfg = pretrain_config(optimizer_type=optimizer_type)
+    cfg.mixed_precision = get_precision_config(precision)
+    if base_cfg.moe_flex_dispatcher_backend is not None:
+        cfg.model.moe_flex_dispatcher_backend = base_cfg.moe_flex_dispatcher_backend
+    apply_flex_dispatcher_backend(cfg.model, cfg.model.moe_flex_dispatcher_backend)
+    if base_cfg.pp_layout:
+        cfg.model.pipeline_model_parallel_layout = base_cfg.pp_layout
+    else:
+        cfg.model.pipeline_model_parallel_layout = _get_kimi_k2_pipeline_layout(
+            base_cfg.pipeline_model_parallel_size,
+            base_cfg.virtual_pipeline_model_parallel_size,
+        )
+    set_kimi_k2_common_configs(cfg)
+    set_workload_base_configs(cfg, base_cfg)
+    cfg.comm_overlap.overlap_grad_reduce = True
+    return cfg
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@scripts/performance/configs/kimi/kimi_llm_pretrain.py` around lines 140 -
177, The kimi_k2_pretrain_config_vr200 function duplicates GPU-specific setup
logic found in other factory functions; extract a shared builder (e.g.,
build_kimi_k2_pretrain_config) that accepts GPU-specific params (gpu name,
base_cfg) and performs the common steps: call get_workload_base_config, create
cfg via pretrain_config, attach mixed precision via get_precision_config,
conditionally set cfg.model.moe_flex_dispatcher_backend and call
apply_flex_dispatcher_backend, compute or assign pipeline layout (using
_get_kimi_k2_pipeline_layout when base_cfg.pp_layout is empty), call
set_kimi_k2_common_configs and set_workload_base_configs, and set
cfg.comm_overlap.overlap_grad_reduce; then refactor
kimi_k2_pretrain_config_vr200 to call this shared builder with vr200-specific
args so code paths in functions like kimi_k2_pretrain_config_vr200,
apply_flex_dispatcher_backend, set_kimi_k2_common_configs, and
set_workload_base_configs remain consistent.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@scripts/performance/configs/gpt_oss/gpt_oss_llm_pretrain.py`:
- Around line 89-113: The function gpt_oss_120b_pretrain_config_vr200 currently
defaults config_variant="v1" which fails VR200 workload lookup; update the
function signature in gpt_oss_120b_pretrain_config_vr200 to use
config_variant="v2" (or remove the default and require the caller to pass the
explicit variant), and ensure any internal uses of config_variant (calls to
get_workload_base_config) continue to pass the corrected value so VR200 entries
in gpt_oss_workload_base_configs.py resolve correctly.

In `@scripts/performance/configs/llama/__init__.py`:
- Around line 36-43: The imported symbols (llama31_405b_pretrain_config_b200,
llama31_405b_pretrain_config_b300, llama31_405b_pretrain_config_gb200,
llama31_405b_pretrain_config_gb300, llama31_405b_pretrain_config_h100,
llama31_405b_pretrain_config_vr200) are flagged as unused (F401); to fix,
explicitly export them by adding an __all__ list that includes each of those
names in the module that currently imports them, or alternatively reference them
in a re-exporting statement so ruff/flake8 recognizes they are intentionally
exposed; update the module's top-level exports accordingly and run ruff
check/format to ensure the lint error is resolved.

In `@scripts/performance/configs/llama/llama3_llm_pretrain.py`:
- Around line 120-153: The function llama3_70b_pretrain_config_vr200 currently
defaults config_variant="v1", which conflicts with VR200 V2-only presets; remove
the arbitrary default so callers must pass an explicit config_variant (change
the signature to config_variant: str with no default), update the function
docstring to mention config_variant is required, and keep the rest of the logic
(get_workload_base_config(..., config_variant=config_variant)) unchanged so
callers must opt into "v2" or other variants explicitly.

In `@scripts/performance/configs/llama/llama31_llm_pretrain.py`:
- Around line 115-148: The function llama31_405b_pretrain_config_vr200 currently
defaults config_variant to "v1" which is incompatible with VR200 presets; change
the config_variant default to "v2" (or remove the default and require callers to
pass the variant) and ensure the get_workload_base_config(...) call uses that
corrected value so workload-base lookup succeeds; optionally add a simple
validation in llama31_405b_pretrain_config_vr200 to raise a clear error if an
unsupported variant (e.g., "v1") is passed.

---

Nitpick comments:
In `@scripts/performance/configs/kimi/kimi_llm_pretrain.py`:
- Around line 140-177: The kimi_k2_pretrain_config_vr200 function duplicates
GPU-specific setup logic found in other factory functions; extract a shared
builder (e.g., build_kimi_k2_pretrain_config) that accepts GPU-specific params
(gpu name, base_cfg) and performs the common steps: call
get_workload_base_config, create cfg via pretrain_config, attach mixed precision
via get_precision_config, conditionally set
cfg.model.moe_flex_dispatcher_backend and call apply_flex_dispatcher_backend,
compute or assign pipeline layout (using _get_kimi_k2_pipeline_layout when
base_cfg.pp_layout is empty), call set_kimi_k2_common_configs and
set_workload_base_configs, and set cfg.comm_overlap.overlap_grad_reduce; then
refactor kimi_k2_pretrain_config_vr200 to call this shared builder with
vr200-specific args so code paths in functions like
kimi_k2_pretrain_config_vr200, apply_flex_dispatcher_backend,
set_kimi_k2_common_configs, and set_workload_base_configs remain consistent.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro Plus

Run ID: b92e10ce-07f1-4bfa-b898-72b3c33f8bce

📥 Commits

Reviewing files that changed from the base of the PR and between 87ba119 and 828c6ea.

📒 Files selected for processing (14)
  • scripts/performance/configs/gpt_oss/__init__.py
  • scripts/performance/configs/gpt_oss/gpt_oss_llm_pretrain.py
  • scripts/performance/configs/gpt_oss/gpt_oss_workload_base_configs.py
  • scripts/performance/configs/kimi/__init__.py
  • scripts/performance/configs/kimi/kimi_llm_pretrain.py
  • scripts/performance/configs/kimi/kimi_workload_base_configs.py
  • scripts/performance/configs/llama/__init__.py
  • scripts/performance/configs/llama/llama31_llm_pretrain.py
  • scripts/performance/configs/llama/llama31_workload_base_configs.py
  • scripts/performance/configs/llama/llama3_llm_pretrain.py
  • scripts/performance/configs/llama/llama3_workload_base_configs.py
  • scripts/performance/configs/qwen/__init__.py
  • scripts/performance/configs/qwen/qwen3_llm_pretrain.py
  • scripts/performance/configs/qwen/qwen3_workload_base_configs.py

Comment on lines +89 to +113
def gpt_oss_120b_pretrain_config_vr200(
precision: str = "bf16", mock: bool = True, config_variant: str = "v1"
) -> ConfigContainer:
"""VR200, baseline config."""
base_cfg = get_workload_base_config(
model_family_name="gpt_oss",
model_recipe_name="gpt_oss_120b",
gpu="vr200",
compute_dtype=precision.upper(),
task="pretrain",
config_variant=config_variant,
)
precision_config = get_precision_config(precision)

cfg = gpt_oss_120b_pretrain_config()
cfg.mixed_precision = precision_config
if base_cfg.moe_flex_dispatcher_backend is not None:
apply_flex_dispatcher_backend(cfg.model, base_cfg.moe_flex_dispatcher_backend)
cfg.comm_overlap = CommOverlapConfig(tp_comm_overlap=bool(base_cfg.tensor_model_parallel_size > 1))
cfg.comm_overlap.tp_comm_overlap = False if precision == "nvfp4" else cfg.comm_overlap.tp_comm_overlap
set_gpt_oss_common_configs(cfg)
set_workload_base_configs(cfg, base_cfg)

return cfg

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🔴 Critical

Default config_variant="v1" breaks VR200 lookup.

At Line 90, the default variant is "v1", but VR200 workload entries in scripts/performance/configs/gpt_oss/gpt_oss_workload_base_configs.py are V2-only. A default call will fail config resolution.

💡 Proposed fix
 def gpt_oss_120b_pretrain_config_vr200(
-    precision: str = "bf16", mock: bool = True, config_variant: str = "v1"
+    precision: str = "bf16", mock: bool = True, config_variant: str = "v2"
 ) -> ConfigContainer:

As per coding guidelines, "Do not add arbitrary defaults for configs; be as explicit as possible."

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@scripts/performance/configs/gpt_oss/gpt_oss_llm_pretrain.py` around lines 89
- 113, The function gpt_oss_120b_pretrain_config_vr200 currently defaults
config_variant="v1" which fails VR200 workload lookup; update the function
signature in gpt_oss_120b_pretrain_config_vr200 to use config_variant="v2" (or
remove the default and require the caller to pass the explicit variant), and
ensure any internal uses of config_variant (calls to get_workload_base_config)
continue to pass the corrected value so VR200 entries in
gpt_oss_workload_base_configs.py resolve correctly.

Comment on lines 36 to 43
from .llama31_llm_pretrain import (
llama31_405b_pretrain_config_b200,
llama31_405b_pretrain_config_b300,
llama31_405b_pretrain_config_gb200,
llama31_405b_pretrain_config_gb300,
llama31_405b_pretrain_config_h100,
llama31_405b_pretrain_config_vr200,
)
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

Fix CI-blocking F401 on llama31_llm_pretrain imports.

These imports are being flagged unused by Flake8 in the current re-export pattern, which will fail lint in CI.

🛠️ Minimal lint fix
-    from .llama31_llm_pretrain import (
+    from .llama31_llm_pretrain import (  # noqa: F401
         llama31_405b_pretrain_config_b200,
         llama31_405b_pretrain_config_b300,
         llama31_405b_pretrain_config_gb200,
         llama31_405b_pretrain_config_gb300,
         llama31_405b_pretrain_config_h100,
         llama31_405b_pretrain_config_vr200,
     )
As per coding guidelines "`**/*.py`: Use ruff for linting and formatting Python code. Run `uv run ruff check --fix .` and `uv run ruff format .` to fix most issues. CI does not auto-fix linting and formatting issues."
🧰 Tools
🪛 Flake8 (7.3.0)

[error] 36-36: '.llama31_llm_pretrain.llama31_405b_pretrain_config_b200' imported but unused

(F401)


[error] 36-36: '.llama31_llm_pretrain.llama31_405b_pretrain_config_b300' imported but unused

(F401)


[error] 36-36: '.llama31_llm_pretrain.llama31_405b_pretrain_config_gb200' imported but unused

(F401)


[error] 36-36: '.llama31_llm_pretrain.llama31_405b_pretrain_config_gb300' imported but unused

(F401)


[error] 36-36: '.llama31_llm_pretrain.llama31_405b_pretrain_config_h100' imported but unused

(F401)


[error] 36-36: '.llama31_llm_pretrain.llama31_405b_pretrain_config_vr200' imported but unused

(F401)

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@scripts/performance/configs/llama/__init__.py` around lines 36 - 43, The
imported symbols (llama31_405b_pretrain_config_b200,
llama31_405b_pretrain_config_b300, llama31_405b_pretrain_config_gb200,
llama31_405b_pretrain_config_gb300, llama31_405b_pretrain_config_h100,
llama31_405b_pretrain_config_vr200) are flagged as unused (F401); to fix,
explicitly export them by adding an __all__ list that includes each of those
names in the module that currently imports them, or alternatively reference them
in a re-exporting statement so ruff/flake8 recognizes they are intentionally
exposed; update the module's top-level exports accordingly and run ruff
check/format to ensure the lint error is resolved.

Comment on lines +120 to +153
def llama3_70b_pretrain_config_vr200(
precision: str = "bf16", mock: bool = True, config_variant: str = "v1"
) -> ConfigContainer:
"""VR200, baseline config."""
base_cfg = get_workload_base_config(
model_family_name="llama",
model_recipe_name="llama3_70b",
gpu="vr200",
compute_dtype=precision.upper(),
task="pretrain",
config_variant=config_variant,
)
precision_config = get_precision_config(precision)

if precision == "bf16":
comm_overlap_cfg = userbuffers_bf16_b200_h8192_tp2_mbs1_seqlen8192
else:
comm_overlap_cfg = userbuffers_fp8_b200_h8192_tp2_mbs1_seqlen8192

cfg = llama3_70b_pretrain_config()
cfg.mixed_precision = precision_config
set_llama3_common_configs(cfg)
set_workload_base_configs(cfg, base_cfg)

if cfg.ddp.use_megatron_fsdp:
cfg.ddp.fsdp_double_buffer = True
cfg.model.gradient_accumulation_fusion = False # Disabled to avoid functional errors
cfg.ddp.suggested_communication_unit_size = 800000000

cfg.comm_overlap.tp_comm_overlap_cfg = comm_overlap_cfg
cfg.comm_overlap.tp_comm_overlap = False if precision == "nvfp4" else cfg.comm_overlap.tp_comm_overlap

return cfg

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🔴 Critical

Default VR200 config variant should not be v1.

At Line 121, defaulting config_variant to "v1" conflicts with the newly added VR200 workload presets (V2-only), so default calls will fail.

💡 Proposed fix
 def llama3_70b_pretrain_config_vr200(
-    precision: str = "bf16", mock: bool = True, config_variant: str = "v1"
+    precision: str = "bf16", mock: bool = True, config_variant: str = "v2"
 ) -> ConfigContainer:

As per coding guidelines, "Do not add arbitrary defaults for configs; be as explicit as possible."

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@scripts/performance/configs/llama/llama3_llm_pretrain.py` around lines 120 -
153, The function llama3_70b_pretrain_config_vr200 currently defaults
config_variant="v1", which conflicts with VR200 V2-only presets; remove the
arbitrary default so callers must pass an explicit config_variant (change the
signature to config_variant: str with no default), update the function docstring
to mention config_variant is required, and keep the rest of the logic
(get_workload_base_config(..., config_variant=config_variant)) unchanged so
callers must opt into "v2" or other variants explicitly.

Comment on lines +115 to +148
def llama31_405b_pretrain_config_vr200(
precision: str = "bf16", mock: bool = True, config_variant: str = "v1"
) -> ConfigContainer:
"""VR200, baseline config."""
base_cfg = get_workload_base_config(
model_family_name="llama",
model_recipe_name="llama31_405b",
gpu="vr200",
compute_dtype=precision.upper(),
task="pretrain",
config_variant=config_variant,
)
precision_config = get_precision_config(precision)

if precision == "bf16":
comm_overlap_cfg = userbuffers_bf16_b200_h16384_tp4_cp2_mbs1_seqlen8192
else:
comm_overlap_cfg = userbuffers_fp8_b200_h16384_tp4_cp2_mbs1_seqlen8192

cfg = llama31_405b_pretrain_config()
cfg.mixed_precision = precision_config
set_llama31_common_configs(cfg)
set_workload_base_configs(cfg, base_cfg)

if cfg.ddp.use_megatron_fsdp:
cfg.ddp.fsdp_double_buffer = True
cfg.model.gradient_accumulation_fusion = False # Disabled to avoid functional errors
cfg.ddp.num_distributed_optimizer_instances = 2

cfg.comm_overlap.tp_comm_overlap_cfg = comm_overlap_cfg
cfg.comm_overlap.tp_comm_overlap = False if precision == "nvfp4" else cfg.comm_overlap.tp_comm_overlap

return cfg

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🔴 Critical

VR200 function default variant is incompatible with available presets.

At Line 116, config_variant defaults to "v1", but VR200 presets for this model are introduced as V2-only. Default calls will fail workload-base lookup.

💡 Proposed fix
 def llama31_405b_pretrain_config_vr200(
-    precision: str = "bf16", mock: bool = True, config_variant: str = "v1"
+    precision: str = "bf16", mock: bool = True, config_variant: str = "v2"
 ) -> ConfigContainer:

As per coding guidelines, "Do not add arbitrary defaults for configs; be as explicit as possible."

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@scripts/performance/configs/llama/llama31_llm_pretrain.py` around lines 115 -
148, The function llama31_405b_pretrain_config_vr200 currently defaults
config_variant to "v1" which is incompatible with VR200 presets; change the
config_variant default to "v2" (or remove the default and require callers to
pass the variant) and ensure the get_workload_base_config(...) call uses that
corrected value so workload-base lookup succeeds; optionally add a simple
validation in llama31_405b_pretrain_config_vr200 to raise a clear error if an
unsupported variant (e.g., "v1") is passed.

@dingqingy-nv
Copy link
Copy Markdown
Contributor

/claude review

@claude
Copy link
Copy Markdown
Contributor

claude bot commented Apr 17, 2026

LGTM

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants