VR200 cfgs, Lm3 70b, 405b, qwen3 30b, 235b, gpt-oss, kimi#3374
VR200 cfgs, Lm3 70b, 405b, qwen3 30b, 235b, gpt-oss, kimi#3374malay-nagda wants to merge 4 commits intomainfrom
Conversation
Signed-off-by: Malay Nagda <malayn@nvidia.com>
|
Auto-sync is disabled for draft pull requests in this repository. Workflows must be run manually. Contributors can view more details about this message here. |
Signed-off-by: Malay Nagda <malayn@nvidia.com>
Signed-off-by: Malay Nagda <malayn@nvidia.com>
📝 WalkthroughWalkthroughThe PR adds VR200 hardware target support for pretraining configurations across four model families: GPT-OSS 120B, Kimi K2, Llama 70B/405B, and Qwen3 30B/235B. For each model, new VR200-specific configuration factory functions are introduced alongside workload base configuration aliases for multiple precision formats. Changes
Estimated code review effort🎯 3 (Moderate) | ⏱️ ~25 minutes Possibly related PRs
Suggested labels
Suggested reviewers
🚥 Pre-merge checks | ✅ 4✅ Passed checks (4 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches📝 Generate docstrings
🧪 Generate unit tests (beta)
Comment |
There was a problem hiding this comment.
Actionable comments posted: 4
🧹 Nitpick comments (1)
scripts/performance/configs/kimi/kimi_llm_pretrain.py (1)
140-177: Consider extracting a shared Kimi pretrain-config builder.This new VR200 function duplicates the same setup path used by other GPU-specific factories. A small internal helper would reduce drift risk between targets.
♻️ Refactor sketch
+def _kimi_k2_pretrain_config_for_gpu( + *, + gpu: str, + precision: str = "bf16", + config_variant: str = "v1", + optimizer_type: str = "muon", +) -> ConfigContainer: + base_cfg = get_workload_base_config( + model_family_name="kimi", + model_recipe_name="kimi_k2", + gpu=gpu, + compute_dtype=precision.upper(), + task="pretrain", + config_variant=config_variant, + ) + cfg = pretrain_config(optimizer_type=optimizer_type) + cfg.mixed_precision = get_precision_config(precision) + if base_cfg.moe_flex_dispatcher_backend is not None: + cfg.model.moe_flex_dispatcher_backend = base_cfg.moe_flex_dispatcher_backend + apply_flex_dispatcher_backend(cfg.model, cfg.model.moe_flex_dispatcher_backend) + if base_cfg.pp_layout: + cfg.model.pipeline_model_parallel_layout = base_cfg.pp_layout + else: + cfg.model.pipeline_model_parallel_layout = _get_kimi_k2_pipeline_layout( + base_cfg.pipeline_model_parallel_size, + base_cfg.virtual_pipeline_model_parallel_size, + ) + set_kimi_k2_common_configs(cfg) + set_workload_base_configs(cfg, base_cfg) + cfg.comm_overlap.overlap_grad_reduce = True + return cfg🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@scripts/performance/configs/kimi/kimi_llm_pretrain.py` around lines 140 - 177, The kimi_k2_pretrain_config_vr200 function duplicates GPU-specific setup logic found in other factory functions; extract a shared builder (e.g., build_kimi_k2_pretrain_config) that accepts GPU-specific params (gpu name, base_cfg) and performs the common steps: call get_workload_base_config, create cfg via pretrain_config, attach mixed precision via get_precision_config, conditionally set cfg.model.moe_flex_dispatcher_backend and call apply_flex_dispatcher_backend, compute or assign pipeline layout (using _get_kimi_k2_pipeline_layout when base_cfg.pp_layout is empty), call set_kimi_k2_common_configs and set_workload_base_configs, and set cfg.comm_overlap.overlap_grad_reduce; then refactor kimi_k2_pretrain_config_vr200 to call this shared builder with vr200-specific args so code paths in functions like kimi_k2_pretrain_config_vr200, apply_flex_dispatcher_backend, set_kimi_k2_common_configs, and set_workload_base_configs remain consistent.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@scripts/performance/configs/gpt_oss/gpt_oss_llm_pretrain.py`:
- Around line 89-113: The function gpt_oss_120b_pretrain_config_vr200 currently
defaults config_variant="v1" which fails VR200 workload lookup; update the
function signature in gpt_oss_120b_pretrain_config_vr200 to use
config_variant="v2" (or remove the default and require the caller to pass the
explicit variant), and ensure any internal uses of config_variant (calls to
get_workload_base_config) continue to pass the corrected value so VR200 entries
in gpt_oss_workload_base_configs.py resolve correctly.
In `@scripts/performance/configs/llama/__init__.py`:
- Around line 36-43: The imported symbols (llama31_405b_pretrain_config_b200,
llama31_405b_pretrain_config_b300, llama31_405b_pretrain_config_gb200,
llama31_405b_pretrain_config_gb300, llama31_405b_pretrain_config_h100,
llama31_405b_pretrain_config_vr200) are flagged as unused (F401); to fix,
explicitly export them by adding an __all__ list that includes each of those
names in the module that currently imports them, or alternatively reference them
in a re-exporting statement so ruff/flake8 recognizes they are intentionally
exposed; update the module's top-level exports accordingly and run ruff
check/format to ensure the lint error is resolved.
In `@scripts/performance/configs/llama/llama3_llm_pretrain.py`:
- Around line 120-153: The function llama3_70b_pretrain_config_vr200 currently
defaults config_variant="v1", which conflicts with VR200 V2-only presets; remove
the arbitrary default so callers must pass an explicit config_variant (change
the signature to config_variant: str with no default), update the function
docstring to mention config_variant is required, and keep the rest of the logic
(get_workload_base_config(..., config_variant=config_variant)) unchanged so
callers must opt into "v2" or other variants explicitly.
In `@scripts/performance/configs/llama/llama31_llm_pretrain.py`:
- Around line 115-148: The function llama31_405b_pretrain_config_vr200 currently
defaults config_variant to "v1" which is incompatible with VR200 presets; change
the config_variant default to "v2" (or remove the default and require callers to
pass the variant) and ensure the get_workload_base_config(...) call uses that
corrected value so workload-base lookup succeeds; optionally add a simple
validation in llama31_405b_pretrain_config_vr200 to raise a clear error if an
unsupported variant (e.g., "v1") is passed.
---
Nitpick comments:
In `@scripts/performance/configs/kimi/kimi_llm_pretrain.py`:
- Around line 140-177: The kimi_k2_pretrain_config_vr200 function duplicates
GPU-specific setup logic found in other factory functions; extract a shared
builder (e.g., build_kimi_k2_pretrain_config) that accepts GPU-specific params
(gpu name, base_cfg) and performs the common steps: call
get_workload_base_config, create cfg via pretrain_config, attach mixed precision
via get_precision_config, conditionally set
cfg.model.moe_flex_dispatcher_backend and call apply_flex_dispatcher_backend,
compute or assign pipeline layout (using _get_kimi_k2_pipeline_layout when
base_cfg.pp_layout is empty), call set_kimi_k2_common_configs and
set_workload_base_configs, and set cfg.comm_overlap.overlap_grad_reduce; then
refactor kimi_k2_pretrain_config_vr200 to call this shared builder with
vr200-specific args so code paths in functions like
kimi_k2_pretrain_config_vr200, apply_flex_dispatcher_backend,
set_kimi_k2_common_configs, and set_workload_base_configs remain consistent.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Pro Plus
Run ID: b92e10ce-07f1-4bfa-b898-72b3c33f8bce
📒 Files selected for processing (14)
scripts/performance/configs/gpt_oss/__init__.pyscripts/performance/configs/gpt_oss/gpt_oss_llm_pretrain.pyscripts/performance/configs/gpt_oss/gpt_oss_workload_base_configs.pyscripts/performance/configs/kimi/__init__.pyscripts/performance/configs/kimi/kimi_llm_pretrain.pyscripts/performance/configs/kimi/kimi_workload_base_configs.pyscripts/performance/configs/llama/__init__.pyscripts/performance/configs/llama/llama31_llm_pretrain.pyscripts/performance/configs/llama/llama31_workload_base_configs.pyscripts/performance/configs/llama/llama3_llm_pretrain.pyscripts/performance/configs/llama/llama3_workload_base_configs.pyscripts/performance/configs/qwen/__init__.pyscripts/performance/configs/qwen/qwen3_llm_pretrain.pyscripts/performance/configs/qwen/qwen3_workload_base_configs.py
| def gpt_oss_120b_pretrain_config_vr200( | ||
| precision: str = "bf16", mock: bool = True, config_variant: str = "v1" | ||
| ) -> ConfigContainer: | ||
| """VR200, baseline config.""" | ||
| base_cfg = get_workload_base_config( | ||
| model_family_name="gpt_oss", | ||
| model_recipe_name="gpt_oss_120b", | ||
| gpu="vr200", | ||
| compute_dtype=precision.upper(), | ||
| task="pretrain", | ||
| config_variant=config_variant, | ||
| ) | ||
| precision_config = get_precision_config(precision) | ||
|
|
||
| cfg = gpt_oss_120b_pretrain_config() | ||
| cfg.mixed_precision = precision_config | ||
| if base_cfg.moe_flex_dispatcher_backend is not None: | ||
| apply_flex_dispatcher_backend(cfg.model, base_cfg.moe_flex_dispatcher_backend) | ||
| cfg.comm_overlap = CommOverlapConfig(tp_comm_overlap=bool(base_cfg.tensor_model_parallel_size > 1)) | ||
| cfg.comm_overlap.tp_comm_overlap = False if precision == "nvfp4" else cfg.comm_overlap.tp_comm_overlap | ||
| set_gpt_oss_common_configs(cfg) | ||
| set_workload_base_configs(cfg, base_cfg) | ||
|
|
||
| return cfg | ||
|
|
There was a problem hiding this comment.
Default config_variant="v1" breaks VR200 lookup.
At Line 90, the default variant is "v1", but VR200 workload entries in scripts/performance/configs/gpt_oss/gpt_oss_workload_base_configs.py are V2-only. A default call will fail config resolution.
💡 Proposed fix
def gpt_oss_120b_pretrain_config_vr200(
- precision: str = "bf16", mock: bool = True, config_variant: str = "v1"
+ precision: str = "bf16", mock: bool = True, config_variant: str = "v2"
) -> ConfigContainer:As per coding guidelines, "Do not add arbitrary defaults for configs; be as explicit as possible."
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@scripts/performance/configs/gpt_oss/gpt_oss_llm_pretrain.py` around lines 89
- 113, The function gpt_oss_120b_pretrain_config_vr200 currently defaults
config_variant="v1" which fails VR200 workload lookup; update the function
signature in gpt_oss_120b_pretrain_config_vr200 to use config_variant="v2" (or
remove the default and require the caller to pass the explicit variant), and
ensure any internal uses of config_variant (calls to get_workload_base_config)
continue to pass the corrected value so VR200 entries in
gpt_oss_workload_base_configs.py resolve correctly.
| from .llama31_llm_pretrain import ( | ||
| llama31_405b_pretrain_config_b200, | ||
| llama31_405b_pretrain_config_b300, | ||
| llama31_405b_pretrain_config_gb200, | ||
| llama31_405b_pretrain_config_gb300, | ||
| llama31_405b_pretrain_config_h100, | ||
| llama31_405b_pretrain_config_vr200, | ||
| ) |
There was a problem hiding this comment.
Fix CI-blocking F401 on llama31_llm_pretrain imports.
These imports are being flagged unused by Flake8 in the current re-export pattern, which will fail lint in CI.
🛠️ Minimal lint fix
- from .llama31_llm_pretrain import (
+ from .llama31_llm_pretrain import ( # noqa: F401
llama31_405b_pretrain_config_b200,
llama31_405b_pretrain_config_b300,
llama31_405b_pretrain_config_gb200,
llama31_405b_pretrain_config_gb300,
llama31_405b_pretrain_config_h100,
llama31_405b_pretrain_config_vr200,
)🧰 Tools
🪛 Flake8 (7.3.0)
[error] 36-36: '.llama31_llm_pretrain.llama31_405b_pretrain_config_b200' imported but unused
(F401)
[error] 36-36: '.llama31_llm_pretrain.llama31_405b_pretrain_config_b300' imported but unused
(F401)
[error] 36-36: '.llama31_llm_pretrain.llama31_405b_pretrain_config_gb200' imported but unused
(F401)
[error] 36-36: '.llama31_llm_pretrain.llama31_405b_pretrain_config_gb300' imported but unused
(F401)
[error] 36-36: '.llama31_llm_pretrain.llama31_405b_pretrain_config_h100' imported but unused
(F401)
[error] 36-36: '.llama31_llm_pretrain.llama31_405b_pretrain_config_vr200' imported but unused
(F401)
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@scripts/performance/configs/llama/__init__.py` around lines 36 - 43, The
imported symbols (llama31_405b_pretrain_config_b200,
llama31_405b_pretrain_config_b300, llama31_405b_pretrain_config_gb200,
llama31_405b_pretrain_config_gb300, llama31_405b_pretrain_config_h100,
llama31_405b_pretrain_config_vr200) are flagged as unused (F401); to fix,
explicitly export them by adding an __all__ list that includes each of those
names in the module that currently imports them, or alternatively reference them
in a re-exporting statement so ruff/flake8 recognizes they are intentionally
exposed; update the module's top-level exports accordingly and run ruff
check/format to ensure the lint error is resolved.
| def llama3_70b_pretrain_config_vr200( | ||
| precision: str = "bf16", mock: bool = True, config_variant: str = "v1" | ||
| ) -> ConfigContainer: | ||
| """VR200, baseline config.""" | ||
| base_cfg = get_workload_base_config( | ||
| model_family_name="llama", | ||
| model_recipe_name="llama3_70b", | ||
| gpu="vr200", | ||
| compute_dtype=precision.upper(), | ||
| task="pretrain", | ||
| config_variant=config_variant, | ||
| ) | ||
| precision_config = get_precision_config(precision) | ||
|
|
||
| if precision == "bf16": | ||
| comm_overlap_cfg = userbuffers_bf16_b200_h8192_tp2_mbs1_seqlen8192 | ||
| else: | ||
| comm_overlap_cfg = userbuffers_fp8_b200_h8192_tp2_mbs1_seqlen8192 | ||
|
|
||
| cfg = llama3_70b_pretrain_config() | ||
| cfg.mixed_precision = precision_config | ||
| set_llama3_common_configs(cfg) | ||
| set_workload_base_configs(cfg, base_cfg) | ||
|
|
||
| if cfg.ddp.use_megatron_fsdp: | ||
| cfg.ddp.fsdp_double_buffer = True | ||
| cfg.model.gradient_accumulation_fusion = False # Disabled to avoid functional errors | ||
| cfg.ddp.suggested_communication_unit_size = 800000000 | ||
|
|
||
| cfg.comm_overlap.tp_comm_overlap_cfg = comm_overlap_cfg | ||
| cfg.comm_overlap.tp_comm_overlap = False if precision == "nvfp4" else cfg.comm_overlap.tp_comm_overlap | ||
|
|
||
| return cfg | ||
|
|
There was a problem hiding this comment.
Default VR200 config variant should not be v1.
At Line 121, defaulting config_variant to "v1" conflicts with the newly added VR200 workload presets (V2-only), so default calls will fail.
💡 Proposed fix
def llama3_70b_pretrain_config_vr200(
- precision: str = "bf16", mock: bool = True, config_variant: str = "v1"
+ precision: str = "bf16", mock: bool = True, config_variant: str = "v2"
) -> ConfigContainer:As per coding guidelines, "Do not add arbitrary defaults for configs; be as explicit as possible."
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@scripts/performance/configs/llama/llama3_llm_pretrain.py` around lines 120 -
153, The function llama3_70b_pretrain_config_vr200 currently defaults
config_variant="v1", which conflicts with VR200 V2-only presets; remove the
arbitrary default so callers must pass an explicit config_variant (change the
signature to config_variant: str with no default), update the function docstring
to mention config_variant is required, and keep the rest of the logic
(get_workload_base_config(..., config_variant=config_variant)) unchanged so
callers must opt into "v2" or other variants explicitly.
| def llama31_405b_pretrain_config_vr200( | ||
| precision: str = "bf16", mock: bool = True, config_variant: str = "v1" | ||
| ) -> ConfigContainer: | ||
| """VR200, baseline config.""" | ||
| base_cfg = get_workload_base_config( | ||
| model_family_name="llama", | ||
| model_recipe_name="llama31_405b", | ||
| gpu="vr200", | ||
| compute_dtype=precision.upper(), | ||
| task="pretrain", | ||
| config_variant=config_variant, | ||
| ) | ||
| precision_config = get_precision_config(precision) | ||
|
|
||
| if precision == "bf16": | ||
| comm_overlap_cfg = userbuffers_bf16_b200_h16384_tp4_cp2_mbs1_seqlen8192 | ||
| else: | ||
| comm_overlap_cfg = userbuffers_fp8_b200_h16384_tp4_cp2_mbs1_seqlen8192 | ||
|
|
||
| cfg = llama31_405b_pretrain_config() | ||
| cfg.mixed_precision = precision_config | ||
| set_llama31_common_configs(cfg) | ||
| set_workload_base_configs(cfg, base_cfg) | ||
|
|
||
| if cfg.ddp.use_megatron_fsdp: | ||
| cfg.ddp.fsdp_double_buffer = True | ||
| cfg.model.gradient_accumulation_fusion = False # Disabled to avoid functional errors | ||
| cfg.ddp.num_distributed_optimizer_instances = 2 | ||
|
|
||
| cfg.comm_overlap.tp_comm_overlap_cfg = comm_overlap_cfg | ||
| cfg.comm_overlap.tp_comm_overlap = False if precision == "nvfp4" else cfg.comm_overlap.tp_comm_overlap | ||
|
|
||
| return cfg | ||
|
|
There was a problem hiding this comment.
VR200 function default variant is incompatible with available presets.
At Line 116, config_variant defaults to "v1", but VR200 presets for this model are introduced as V2-only. Default calls will fail workload-base lookup.
💡 Proposed fix
def llama31_405b_pretrain_config_vr200(
- precision: str = "bf16", mock: bool = True, config_variant: str = "v1"
+ precision: str = "bf16", mock: bool = True, config_variant: str = "v2"
) -> ConfigContainer:As per coding guidelines, "Do not add arbitrary defaults for configs; be as explicit as possible."
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@scripts/performance/configs/llama/llama31_llm_pretrain.py` around lines 115 -
148, The function llama31_405b_pretrain_config_vr200 currently defaults
config_variant to "v1" which is incompatible with VR200 presets; change the
config_variant default to "v2" (or remove the default and require callers to
pass the variant) and ensure the get_workload_base_config(...) call uses that
corrected value so workload-base lookup succeeds; optionally add a simple
validation in llama31_405b_pretrain_config_vr200 to raise a clear error if an
unsupported variant (e.g., "v1") is passed.
|
/claude review |
|
LGTM |
What does this PR do ?
Adds support for VR200 system for the models- Llama3 70B, Llama3 405B, Qwen3 30B_a3B, Qwen3 235B_a22B, GPT-OSS 120B and Kimi-K2.
Changelog
GitHub Actions CI
See the CI sectionin the Contributing doc for how to trigger the CI. A Nvidia developer will need to approve and trigger the CI for external contributors.
Before your PR is "Ready for review"
Pre checks:
If you haven't finished some of the above items you can still open "Draft" PR.
Additional Information
Summary by CodeRabbit
New Features