Fix: Handle FusedLAMB Fails with LowLevelZeroPlugin When Using CPU Offload by Truong5724 · Pull Request #6418 · hpcaitech/ColossalAI

Truong5724 · 2026-04-15T11:59:31Z

📌 Checklist before creating the PR

I have created an issue for this PR for traceability
The title follows the standard format: [doc/gemini/tensor/...]: A concise description
I have added relevant tags if possible for us to better distinguish different PRs
I have installed pre-commit: pip install pre-commit && pre-commit install

🚨 Issue number

📝 What does this PR do?

Problem

When using LowLevelZeroPlugin with CPU offload and the FusedLAMB optimizer, training fails during the optimizer step with the following error:

RuntimeError: expected input to be on cuda

The error occurs inside multi_tensor_applier during gradient norm computation:

g_norm_32 = multi_tensor_applier(
    self.multi_tensor_l2norm,
    self._dummy_overflow_buf,
    [g_all_32],
    False
)[0]

This issue only appears when CPU offload is enabled. Under normal GPU execution, FusedLAMB works correctly.
The root cause is that when CPU offload is used, optimizer states and gradients may reside on CPU, while FusedLAMB relies on fused CUDA kernels that require all input tensors to be on GPU.
However, multi_tensor_applier does not enforce or validate device placement before launching CUDA kernels, leading to a device mismatch error at runtime.

Solution

To ensure compatibility with CPU offload, this PR ensures that all tensors passed to multi_tensor_applier are moved to CUDA before kernel execution.
Updated MultiTensorApply.__call__ in multi_tensor_apply.py:

def __call__(self, op, noop_flag_buffer, tensor_lists, *args):
    self.check_avail()
  
    # Move tensors to GPU if not already on GPU
    for i, tensor_list in enumerate(tensor_lists):
        for j, tensor in enumerate(tensor_list):
            if tensor.device.type == 'cpu':
                tensor_lists[i][j] = tensor.to('cuda')

    return op(self.chunk_size, noop_flag_buffer, tensor_lists, *args)

Implication

Ensures FusedLAMB works correctly with LowLevelZeroPlugin when CPU offload is enabled.
Prevents runtime device mismatch errors in fused CUDA kernels.
Introduces a device transfer step when necessary (CPU → GPU).

Verification

Training runs successfully with:
- LowLevelZeroPlugin + CPU offload.
- FusedLAMB optimizer.
No more expected input to be on cuda errors.
Behavior remains unchanged for pure GPU execution.

💥 Checklist before requesting a review

I have linked my PR to an issue (instruction)
My issue clearly describes the problem/feature/proposal, with diagrams/charts/table/code if possible
I have performed a self-review of my code
I have added thorough tests.
I have added docstrings for all the functions/methods I implemented

⭐️ Do you enjoy contributing to Colossal-AI?

🌝 Yes, I do.
🌚 No, I don't.

for more information, see https://pre-commit.ci

Truong5724 added 2 commits April 15, 2026 11:48

Fix: Move Tensor from CPU to GPU when using FusedLAMB Optimizer

5d687c2

Merge branch 'hpcaitech:main' into Truong-Fix-FusedLAMB

a8139d5

Truong5724 requested a review from a team as a code owner April 15, 2026 11:59

[pre-commit.ci] auto fixes from pre-commit.com hooks

eca76f6

for more information, see https://pre-commit.ci

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix: Handle FusedLAMB Fails with LowLevelZeroPlugin When Using CPU Offload#6418

Fix: Handle FusedLAMB Fails with LowLevelZeroPlugin When Using CPU Offload#6418
Truong5724 wants to merge 3 commits intohpcaitech:mainfrom
Truong5724:Truong-Fix-FusedLAMB

Truong5724 commented Apr 15, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

Truong5724 commented Apr 15, 2026

📌 Checklist before creating the PR

🚨 Issue number

📝 What does this PR do?

Problem

Solution

Implication

Verification

💥 Checklist before requesting a review

⭐️ Do you enjoy contributing to Colossal-AI?

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant