Skip to content

Fix: Handle FusedLAMB Fails with LowLevelZeroPlugin When Using CPU Offload#6418

Open
Truong5724 wants to merge 3 commits intohpcaitech:mainfrom
Truong5724:Truong-Fix-FusedLAMB
Open

Fix: Handle FusedLAMB Fails with LowLevelZeroPlugin When Using CPU Offload#6418
Truong5724 wants to merge 3 commits intohpcaitech:mainfrom
Truong5724:Truong-Fix-FusedLAMB

Conversation

@Truong5724
Copy link
Copy Markdown

📌 Checklist before creating the PR

  • I have created an issue for this PR for traceability
  • The title follows the standard format: [doc/gemini/tensor/...]: A concise description
  • I have added relevant tags if possible for us to better distinguish different PRs
  • I have installed pre-commit: pip install pre-commit && pre-commit install

🚨 Issue number

Fixed #6401

📝 What does this PR do?

Problem

  • When using LowLevelZeroPlugin with CPU offload and the FusedLAMB optimizer, training fails during the optimizer step with the following error:
RuntimeError: expected input to be on cuda
  • The error occurs inside multi_tensor_applier during gradient norm computation:
g_norm_32 = multi_tensor_applier(
    self.multi_tensor_l2norm,
    self._dummy_overflow_buf,
    [g_all_32],
    False
)[0]
  • This issue only appears when CPU offload is enabled. Under normal GPU execution, FusedLAMB works correctly.

  • The root cause is that when CPU offload is used, optimizer states and gradients may reside on CPU, while FusedLAMB relies on fused CUDA kernels that require all input tensors to be on GPU.

  • However, multi_tensor_applier does not enforce or validate device placement before launching CUDA kernels, leading to a device mismatch error at runtime.


Solution

  • To ensure compatibility with CPU offload, this PR ensures that all tensors passed to multi_tensor_applier are moved to CUDA before kernel execution.

  • Updated MultiTensorApply.__call__ in multi_tensor_apply.py:

def __call__(self, op, noop_flag_buffer, tensor_lists, *args):
    self.check_avail()
  
    # Move tensors to GPU if not already on GPU
    for i, tensor_list in enumerate(tensor_lists):
        for j, tensor in enumerate(tensor_list):
            if tensor.device.type == 'cpu':
                tensor_lists[i][j] = tensor.to('cuda')

    return op(self.chunk_size, noop_flag_buffer, tensor_lists, *args)

Implication

  • Ensures FusedLAMB works correctly with LowLevelZeroPlugin when CPU offload is enabled.
  • Prevents runtime device mismatch errors in fused CUDA kernels.
  • Introduces a device transfer step when necessary (CPU → GPU).

Verification

  • Training runs successfully with:
    • LowLevelZeroPlugin + CPU offload.
    • FusedLAMB optimizer.
  • No more expected input to be on cuda errors.
  • Behavior remains unchanged for pure GPU execution.

💥 Checklist before requesting a review

  • I have linked my PR to an issue (instruction)
  • My issue clearly describes the problem/feature/proposal, with diagrams/charts/table/code if possible
  • I have performed a self-review of my code
  • I have added thorough tests.
  • I have added docstrings for all the functions/methods I implemented

⭐️ Do you enjoy contributing to Colossal-AI?

  • 🌝 Yes, I do.
  • 🌚 No, I don't.

@Truong5724 Truong5724 requested a review from a team as a code owner April 15, 2026 11:59
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[BUG]: FusedLAMB Fails with LowLevelZeroPlugin When Using Small initial_scale and CPU Offload

1 participant