fix: SVD convergence crash via CPU LAPACK fallback by umran666 · Pull Request #393 · p-e-w/heretic

umran666 · 2026-06-23T03:17:34Z

Fixes #345.

torch.svd_lowrank internally calls torch.linalg.svd on the GPU, where the cuSOLVER driver's iterative algorithm can fail to converge on certain matrices, crashing the entire study with LinAlgError. This wraps the call in a try...except block and falls back to exact CPU LAPACK SVD via LA.svd, which handles the same decomposition without issue.

gemini-code-assist

Code Review

This pull request introduces a fallback mechanism for low-rank SVD computation. If torch.svd_lowrank fails to converge due to an ill-conditioned matrix, the code now catches torch.linalg.LinAlgError and falls back to exact SVD on the CPU. The reviewer suggested using the existing LA alias for torch.linalg to maintain consistency and adding .detach() before moving the tensor to the CPU to prevent potential autograd issues.

Important

The consumer version of Gemini Code Assist on GitHub is being sunset. Starting June 18, 2026, new organization installations will be blocked, and all code review activity will officially cease on July 17, 2026.
For more details on the timeline and next steps, please review the Help Documentation.

gemini-code-assist · 2026-06-23T03:18:49Z

+                        except torch.linalg.LinAlgError:
+                            # SVD failed to converge (usually because Optuna tested an extreme
+                            # hyperparameter combination, resulting in an ill-conditioned matrix).
+                            # Fall back to CPU LAPACK exact SVD.
+                            U, S, Vh = torch.linalg.svd(W.cpu(), full_matrices=False)
+                            U = U[:, :r].to(W.device)
+                            S = S[:r].to(W.device)
+                            Vh = Vh[:r, :].to(W.device)


For better consistency and defensive programming, consider making the following improvements:

Consistency: The file already imports torch.linalg as LA (on line 11) and uses it elsewhere (e.g., LA.vector_norm on line 562). Using LA.LinAlgError and LA.svd keeps the codebase consistent.

Defensive Programming: If W happens to have requires_grad=True (for example, if the base model is not fully frozen in a custom setup), calling torch.linalg.svd on CPU with a gradient-tracking tensor will raise a RuntimeError because SVD backward is not supported on CPU in PyTorch. Calling .detach() before moving to CPU avoids this issue entirely and prevents unnecessary autograd overhead.

Suggested change

except torch.linalg.LinAlgError:

# SVD failed to converge (usually because Optuna tested an extreme

# hyperparameter combination, resulting in an ill-conditioned matrix).

# Fall back to CPU LAPACK exact SVD.

U, S, Vh = torch.linalg.svd(W.cpu(), full_matrices=False)

U = U[:, :r].to(W.device)

S = S[:r].to(W.device)

Vh = Vh[:r, :].to(W.device)

except LA.LinAlgError:

# SVD failed to converge (usually because Optuna tested an extreme

# hyperparameter combination, resulting in an ill-conditioned matrix).

# Fall back to CPU LAPACK exact SVD.

U, S, Vh = LA.svd(W.detach().cpu(), full_matrices=False)

U = U[:, :r].to(W.device)

S = S[:r].to(W.device)

Vh = Vh[:r, :].to(W.device)

p-e-w · 2026-06-23T06:24:51Z

torch.svd_lowrank crashes with LinAlgError when Optuna tests extreme hyperparameter combinations that produce ill-conditioned weight matrices.

Are you sure? There is absolutely nothing "extreme" about the parameter values from #345, and I fail to see how any combination of values from our ranges could ever produce an ill-conditioned matrix.

umran666 · 2026-06-23T07:38:25Z

There is absolutely nothing "extreme" about the parameter values from #345, and I fail to see how any combination of values from our ranges could ever produce an ill-conditioned matrix.

You're right — the parameters aren't extreme at all, and my description was wrong. The actual root cause is the cuSOLVER driver's iterative SVD algorithm failing to converge on certain matrices, which the traceback itself confirms: "During SVD computation with the selected cusolver driver, batches 0 failed to converge." This is independent of the hyperparameter values. The CPU LAPACK implementation handles the same matrix without issue. I've corrected the code comment accordingly.

p-e-w · 2026-06-23T08:34:20Z

failing to converge on certain matrices

What are those "certain matrices", and how can this be reproduced on a smaller model?

umran666 · 2026-06-23T09:27:46Z

The failure isn't in W itself — it's in the intermediate matrix B = Q.mH @ A that svd_lowrank constructs internally via randomized projection. The cuSOLVER Jacobi solver can fail to converge on that B when it has clustered singular values, which is a known limitation. It's hardware/driver-dependent (reporter had CUDA 12.8, driver 570.148.08), so I don't have a reliable way to reproduce it on a smaller model.

p-e-w · 2026-06-23T14:20:52Z

If the problem is with CUDA rather than with svd_lowrank, why isn't the solution to run svd_lowrank on the CPU?

umran666 · 2026-06-24T12:50:32Z

If the problem is with CUDA rather than with svd_lowrank, why isn't the solution to run svd_lowrank on the CPU?

Updated to use svd_lowrank on CPU in the fallback instead of switching to LA.svd

umran666 changed the title ~~Fix SVD convergence crash via CPU LAPACK fallback~~ Fix : SVD convergence crash via CPU LAPACK fallback Jun 23, 2026

umran666 changed the title ~~Fix : SVD convergence crash via CPU LAPACK fallback~~ fix: SVD convergence crash via CPU LAPACK fallback Jun 23, 2026

gemini-code-assist Bot reviewed Jun 23, 2026

View reviewed changes

umran666 force-pushed the fix-svd-fallback branch from c91e445 to dd66d77 Compare June 23, 2026 03:21

umran666 force-pushed the fix-svd-fallback branch from dd66d77 to 1e191ec Compare June 23, 2026 07:37

Fix SVD convergence crash via CPU LAPACK fallback

445134c

umran666 force-pushed the fix-svd-fallback branch from 1e191ec to 445134c Compare June 24, 2026 12:51

p-e-w mentioned this pull request Jun 25, 2026

I tried many times but couldn't get this tool to work. #345

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix: SVD convergence crash via CPU LAPACK fallback#393

fix: SVD convergence crash via CPU LAPACK fallback#393
umran666 wants to merge 1 commit into
p-e-w:masterfrom
umran666:fix-svd-fallback

umran666 commented Jun 23, 2026 •

edited

Loading

Uh oh!

gemini-code-assist Bot left a comment

Uh oh!

gemini-code-assist Bot Jun 23, 2026

Uh oh!

p-e-w commented Jun 23, 2026

Uh oh!

umran666 commented Jun 23, 2026

Uh oh!

p-e-w commented Jun 23, 2026

Uh oh!

umran666 commented Jun 23, 2026 •

edited

Loading

Uh oh!

p-e-w commented Jun 23, 2026

Uh oh!

umran666 commented Jun 24, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

umran666 commented Jun 23, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

gemini-code-assist Bot Jun 23, 2026

Choose a reason for hiding this comment

Uh oh!

p-e-w commented Jun 23, 2026

Uh oh!

umran666 commented Jun 23, 2026

Uh oh!

p-e-w commented Jun 23, 2026

Uh oh!

umran666 commented Jun 23, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

p-e-w commented Jun 23, 2026

Uh oh!

umran666 commented Jun 24, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

umran666 commented Jun 23, 2026 •

edited

Loading

umran666 commented Jun 23, 2026 •

edited

Loading