feat: diffusion LLM abliteration support (DiffusionGemma)#378
feat: diffusion LLM abliteration support (DiffusionGemma)#378anurag12-webster wants to merge 2 commits into
Conversation
first attempt at abliterating a discrete diffusion MoE model. tested on google/diffusiongemma-26B-A4B-it, got refusals from 100/100 down to 13/100 with KLD 0.49 in 200 trials. this needed a bunch of patches because DiffusionGemma is not a standard autoregressive transformer and heretic makes a lot of assumptions about model architecture. things i ran into: heretic tries to load the model with CAUSAL_LM task type but DiffusionGemma doesn't implement prepare_inputs_for_generation so it crashes. switched to FEATURE_EXTRACTION. the MoE experts are stored as a single batched parameter [128, 2816, 704] not individual linear layers so heretic skips them entirely. without abliterating these the refusals barely moved at all. ended up iterating over all 128 slices per layer and applying the biprojected ablation to each one. encoder and decoder share the exact same weights in memory, same data_ptr. PEFT wraps only the encoder side so the decoder never sees the LoRA delta during generation. took me a while to figure out why abliteration wasn't showing up. fixed with a context manager that merges LoRA into the base weights temporarily before generate(). generate() also doesn't support output_hidden_states so had to use forward hooks on the encoder layers instead to get the per-layer activations for the refusal direction PCA. code is rough, opening as draft. happy to refactor based on feedback. closes p-e-w#370
|
Awesome, thanks! I'll be looking at this in detail after the 1.4 release. Perhaps we can find another diffusion model to make sure this approach generalizes. |
See #342 for an extended discussion of this. /cc @rocker-zhang |
There was a problem hiding this comment.
Code Review
This pull request adds support for the DiffusionGemma model, implementing custom logic for handling its tied encoder/decoder weights, expert weights abliteration, and generation/residual/logprob extraction. Feedback on the changes highlights a potential NameError if the model class cannot be imported, potential runtime errors or weight corruption when using 4-bit quantization, and a loss of reproducibility due to the removal of the RNG seed reset before SVD. Additionally, several style guide violations were identified, including missing return type annotations on new methods and improperly formatted comments.
Important
The consumer version of Gemini Code Assist on GitHub is being sunset. Starting June 18, 2026, new organization installations will be blocked, and all code review activity will officially cease on July 17, 2026.
For more details on the timeline and next steps, please review the Help Documentation.
|
What are your actual numbers: refusals / KL? Also, did you test it with ARA/ARA-LoRA, or only with projected abliteration? For example, this guy's's results are not as optimistic and this has large KL (~0.5) |
LLaDA 1.x/2.x series would be a good candidate, they are well implemented and documented |
|
From https://huggingface.co/edwixx/diffusiongemma-26B-A4B-it-HERETIC-Uncensored
It's very interesting, because the "own" token spawns not only in the diffusion model version, but in the base autoregressive one as well. For example, in my overcooked LoRA training it has shown up too. Maybe, it's some sort of a placeholder token for Gemma4 models in general?
|
I have tried only with the project abliteration haven't tried with the ARA yet. also for the numbers model currently has 13/100 refusals, KLD |
Ohh i see, so my assumption abt the "own" artifact was being a denoising fallback was wrong the, and it is from gemma 4 series itself, i will update the model card with this then, thanks for confirming! |
|
@anurag12-webster Do you have some visualizations: hidden states PCA/PaCMAP? Seeing how the refusal direction (or its equivalent) behaves in space would be incredibly helpful and useful for interpretability |
i actually dont have it, maybe in some time i'll have it run through and shared here |
|
i'll try the Llada series of models with this one and also share some visualizations along with that. |


i managed to abliterate the newer diffusiongemma, i am attaching the changed... which i did to the base herectic codebase.. with some quirks
of all the code four things needed patching to make it work..
code is rough, happy to refactor. closes #370