Skip to content

fix: SVD convergence crash via CPU LAPACK fallback#393

Open
umran666 wants to merge 1 commit into
p-e-w:masterfrom
umran666:fix-svd-fallback
Open

fix: SVD convergence crash via CPU LAPACK fallback#393
umran666 wants to merge 1 commit into
p-e-w:masterfrom
umran666:fix-svd-fallback

Conversation

@umran666

@umran666 umran666 commented Jun 23, 2026

Copy link
Copy Markdown
Contributor

Fixes #345.

torch.svd_lowrank internally calls torch.linalg.svd on the GPU, where the cuSOLVER driver's iterative algorithm can fail to converge on certain matrices, crashing the entire study with LinAlgError. This wraps the call in a try...except block and falls back to exact CPU LAPACK SVD via LA.svd, which handles the same decomposition without issue.

@umran666 umran666 changed the title Fix SVD convergence crash via CPU LAPACK fallback Fix : SVD convergence crash via CPU LAPACK fallback Jun 23, 2026
@umran666 umran666 changed the title Fix : SVD convergence crash via CPU LAPACK fallback fix: SVD convergence crash via CPU LAPACK fallback Jun 23, 2026

@gemini-code-assist gemini-code-assist Bot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces a fallback mechanism for low-rank SVD computation. If torch.svd_lowrank fails to converge due to an ill-conditioned matrix, the code now catches torch.linalg.LinAlgError and falls back to exact SVD on the CPU. The reviewer suggested using the existing LA alias for torch.linalg to maintain consistency and adding .detach() before moving the tensor to the CPU to prevent potential autograd issues.

Important

The consumer version of Gemini Code Assist on GitHub is being sunset. Starting June 18, 2026, new organization installations will be blocked, and all code review activity will officially cease on July 17, 2026.
For more details on the timeline and next steps, please review the Help Documentation.

Comment thread src/heretic/model.py Outdated
Comment on lines +600 to +607
except torch.linalg.LinAlgError:
# SVD failed to converge (usually because Optuna tested an extreme
# hyperparameter combination, resulting in an ill-conditioned matrix).
# Fall back to CPU LAPACK exact SVD.
U, S, Vh = torch.linalg.svd(W.cpu(), full_matrices=False)
U = U[:, :r].to(W.device)
S = S[:r].to(W.device)
Vh = Vh[:r, :].to(W.device)

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

For better consistency and defensive programming, consider making the following improvements:

  1. Consistency: The file already imports torch.linalg as LA (on line 11) and uses it elsewhere (e.g., LA.vector_norm on line 562). Using LA.LinAlgError and LA.svd keeps the codebase consistent.
  2. Defensive Programming: If W happens to have requires_grad=True (for example, if the base model is not fully frozen in a custom setup), calling torch.linalg.svd on CPU with a gradient-tracking tensor will raise a RuntimeError because SVD backward is not supported on CPU in PyTorch. Calling .detach() before moving to CPU avoids this issue entirely and prevents unnecessary autograd overhead.
Suggested change
except torch.linalg.LinAlgError:
# SVD failed to converge (usually because Optuna tested an extreme
# hyperparameter combination, resulting in an ill-conditioned matrix).
# Fall back to CPU LAPACK exact SVD.
U, S, Vh = torch.linalg.svd(W.cpu(), full_matrices=False)
U = U[:, :r].to(W.device)
S = S[:r].to(W.device)
Vh = Vh[:r, :].to(W.device)
except LA.LinAlgError:
# SVD failed to converge (usually because Optuna tested an extreme
# hyperparameter combination, resulting in an ill-conditioned matrix).
# Fall back to CPU LAPACK exact SVD.
U, S, Vh = LA.svd(W.detach().cpu(), full_matrices=False)
U = U[:, :r].to(W.device)
S = S[:r].to(W.device)
Vh = Vh[:r, :].to(W.device)

@p-e-w

p-e-w commented Jun 23, 2026

Copy link
Copy Markdown
Owner

torch.svd_lowrank crashes with LinAlgError when Optuna tests extreme hyperparameter combinations that produce ill-conditioned weight matrices.

Are you sure? There is absolutely nothing "extreme" about the parameter values from #345, and I fail to see how any combination of values from our ranges could ever produce an ill-conditioned matrix.

@umran666

Copy link
Copy Markdown
Contributor Author

There is absolutely nothing "extreme" about the parameter values from #345, and I fail to see how any combination of values from our ranges could ever produce an ill-conditioned matrix.

You're right — the parameters aren't extreme at all, and my description was wrong. The actual root cause is the cuSOLVER driver's iterative SVD algorithm failing to converge on certain matrices, which the traceback itself confirms: "During SVD computation with the selected cusolver driver, batches 0 failed to converge." This is independent of the hyperparameter values. The CPU LAPACK implementation handles the same matrix without issue. I've corrected the code comment accordingly.

@p-e-w

p-e-w commented Jun 23, 2026

Copy link
Copy Markdown
Owner

failing to converge on certain matrices

What are those "certain matrices", and how can this be reproduced on a smaller model?

@umran666

umran666 commented Jun 23, 2026

Copy link
Copy Markdown
Contributor Author

The failure isn't in W itself — it's in the intermediate matrix B = Q.mH @ A that svd_lowrank constructs internally via randomized projection. The cuSOLVER Jacobi solver can fail to converge on that B when it has clustered singular values, which is a known limitation. It's hardware/driver-dependent (reporter had CUDA 12.8, driver 570.148.08), so I don't have a reliable way to reproduce it on a smaller model.

@p-e-w

p-e-w commented Jun 23, 2026

Copy link
Copy Markdown
Owner

If the problem is with CUDA rather than with svd_lowrank, why isn't the solution to run svd_lowrank on the CPU?

@umran666

Copy link
Copy Markdown
Contributor Author

If the problem is with CUDA rather than with svd_lowrank, why isn't the solution to run svd_lowrank on the CPU?

Updated to use svd_lowrank on CPU in the fallback instead of switching to LA.svd

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

I tried many times but couldn't get this tool to work.

2 participants