Skip to content

ci: add GB200 functional test infrastructure#3365

Draft
ko3n1g wants to merge 15 commits intomainfrom
ko3n1g/feat/gb200-functional-tests
Draft

ci: add GB200 functional test infrastructure#3365
ko3n1g wants to merge 15 commits intomainfrom
ko3n1g/feat/gb200-functional-tests

Conversation

@ko3n1g
Copy link
Copy Markdown
Contributor

@ko3n1g ko3n1g commented Apr 17, 2026

Summary

  • Restructures launch_scripts/ from <gpu_type>[-flaky]/ to <gpu_type>/<active|flaky>/ so all hardware targets share the same top-level directory
  • Adds h100/active/, h100/flaky/, gb200/active/, gb200/flaky/ — scripts moved accordingly
  • Extends cicd-main.yml with a parallel GB200 CI track: generate-gb200-test-matrix / cicd-functional-tests-gb200-l{0,1,2} / cicd-functional-tests-gb200-flaky, running on nemo-ci-gcp-gpu-x2
  • Updates test-template/action.yml default script_dir to h100/active
  • Moves L0_Launch_training (h100 + gb200) and L2_Launch_recipes_llama_cuda_graphs (h100 + gb200) to flaky/
  • Adds skills/test-system/SKILL.md documenting CI test layout, tier semantics, script conventions, and flaky handling; trims redundant content from skills/developer-guide/SKILL.md

Directory layout

tests/functional_tests/launch_scripts/
├── h100/
│   ├── active/   ← H100 tests (run normally)
│   └── flaky/    ← H100 tests known to be flaky
└── gb200/
    ├── active/   ← GB200 tests (run normally)
    └── flaky/    ← GB200 tests known to be flaky

Tier semantics

Tier Trigger Blocking
L0 Every PR, push to main, schedule Yes
L1 Push to main, schedule, needs-more-tests label Yes
L2 Schedule and workflow_dispatch Yes (when triggered)
flaky workflow_dispatch with test_suite=all No

Example: marking a test flaky on GB200 only

git mv tests/functional_tests/launch_scripts/gb200/active/L0_Launch_recipes_llama_1b.sh \
       tests/functional_tests/launch_scripts/gb200/flaky/

H100 continues to run h100/active/L0_Launch_recipes_llama_1b.sh unaffected.

Test plan

  • Trigger workflow_dispatch (test_suite=all) — confirm H100 and GB200 flaky jobs pick up the moved scripts
  • Verify L0/L1/L2 active matrices no longer contain L0_Launch_training or L2_Launch_recipes_llama_cuda_graphs

🤖 Generated with Claude Code

Rename launch_scripts/{active → h100}/ and {flaky → h100-flaky}/ so all
directories are named after their target hardware. Add a parallel GB200
track that runs the same tests on {runner_prefix}-gb200-x2 runners.

- launch_scripts/gb200/: thin wrapper scripts that exec into h100/; one
  per h100/ script for full L0/L1/L2 parity at launch
- launch_scripts/gb200-flaky/: empty placeholder; move a GB200 wrapper
  here when it breaks on GB200 but not H100
- cicd-main.yml: generate-gb200-test-matrix job, three
  cicd-functional-tests-gb200-l{0,1,2} jobs and a gb200-flaky job using
  {runner_prefix}-gb200-x2; all gated on vars.GB200_RUNNER_PREFIX being
  set so environments without GB200 runners skip cleanly
- configure: propagates expect_gb200_l{0,1,2} outputs; Nemo_CICD_Test
  validates them the same way as H100 tiers
- test-template action: default script_dir updated from "active" to "h100"

Co-Authored-By: Claude Sonnet 4.6 <[email protected]>
Signed-off-by: oliver könig <[email protected]>
@copy-pr-bot
Copy link
Copy Markdown

copy-pr-bot bot commented Apr 17, 2026

Auto-sync is disabled for draft pull requests in this repository. Workflows must be run manually.

Contributors can view more details about this message here.

@ko3n1g
Copy link
Copy Markdown
Contributor Author

ko3n1g commented Apr 17, 2026

/ok to test

The variable gate caused GB200 jobs to be silently skipped when
GB200_RUNNER_PREFIX was not set as a repo variable. Since GB200 should
always run with the same trigger conditions as H100 (using the same
runner_prefix but -gb200-x2 suffix), remove the gate entirely.

Also simplify Nemo_CICD_Test: GB200 skip-checks reuse EXPECT_L{0,1,2}
rather than a separate EXPECT_GB200_L{0,1,2} set.

Co-Authored-By: Claude Sonnet 4.6 <[email protected]>
Signed-off-by: oliver könig <[email protected]>
@ko3n1g
Copy link
Copy Markdown
Contributor Author

ko3n1g commented Apr 17, 2026

/ok to test

@ko3n1g
Copy link
Copy Markdown
Contributor Author

ko3n1g commented Apr 17, 2026

Updated PR description and workflow to use the literal runner label nemo-ci-gcp-gpu-x2 for all GB200 jobs (replaces the previous {runner_prefix}-gb200-x2 template). The GB200_RUNNER_PREFIX prerequisite no longer applies.

@ko3n1g
Copy link
Copy Markdown
Contributor Author

ko3n1g commented Apr 17, 2026

/ok to test

@ko3n1g
Copy link
Copy Markdown
Contributor Author

ko3n1g commented Apr 17, 2026

/ok to test

@ko3n1g
Copy link
Copy Markdown
Contributor Author

ko3n1g commented Apr 17, 2026

/ok to test

@ko3n1g
Copy link
Copy Markdown
Contributor Author

ko3n1g commented Apr 17, 2026

/ok to test 4ee0e61

Co-Authored-By: Claude Sonnet 4.6 <[email protected]>
Signed-off-by: oliver könig <[email protected]>
@ko3n1g
Copy link
Copy Markdown
Contributor Author

ko3n1g commented Apr 17, 2026

/ok to test adf09fe

…ility

Co-Authored-By: Claude Sonnet 4.6 <[email protected]>
Signed-off-by: oliver könig <[email protected]>
@ko3n1g
Copy link
Copy Markdown
Contributor Author

ko3n1g commented Apr 17, 2026

/ok to test a6c0a9a

@ko3n1g
Copy link
Copy Markdown
Contributor Author

ko3n1g commented Apr 17, 2026

/ok to test 417129d

Co-Authored-By: Claude Sonnet 4.6 <[email protected]>
Signed-off-by: oliver könig <[email protected]>
@ko3n1g
Copy link
Copy Markdown
Contributor Author

ko3n1g commented Apr 17, 2026

/ok to test 52b3779

thomasdhc
thomasdhc previously approved these changes Apr 17, 2026
@ko3n1g
Copy link
Copy Markdown
Contributor Author

ko3n1g commented Apr 17, 2026

/ok to test

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants