Skip to content

feat: diffusion LLM abliteration support (DiffusionGemma)#378

Draft
anurag12-webster wants to merge 2 commits into
p-e-w:masterfrom
anurag12-webster:feat/diffusion-lm-support
Draft

feat: diffusion LLM abliteration support (DiffusionGemma)#378
anurag12-webster wants to merge 2 commits into
p-e-w:masterfrom
anurag12-webster:feat/diffusion-lm-support

Conversation

@anurag12-webster

Copy link
Copy Markdown

i managed to abliterate the newer diffusiongemma, i am attaching the changed... which i did to the base herectic codebase.. with some quirks

of all the code four things needed patching to make it work..

  • switched PEFT task type to FEATURE_EXTRACTION diffusiongemma (DG) doesn't implement prepare_inputs_for_generation so CAUSAL_LM was crashing.
  • MoE experts are a single batched parameter [128, 2816, 704], not nn.Linear, so heretic skips them. added per-slice ablation across all 128 experts per layer
  • encoder and decoder share the same weights in memory (same data_ptr) PEFT only wraps encoder so decoder never saw the LoRA delta. i have fixed it with a context manager that merges LoRA into base weights before generate() method.
  • the generate() method has no output_hidden_states support so i switched it to forward hooks on encoder layers for the PCA

code is rough, happy to refactor. closes #370

first attempt at abliterating a discrete diffusion MoE model.
tested on google/diffusiongemma-26B-A4B-it, got refusals from
100/100 down to 13/100 with KLD 0.49 in 200 trials.

this needed a bunch of patches because DiffusionGemma is not a
standard autoregressive transformer and heretic makes a lot of
assumptions about model architecture.

things i ran into:

heretic tries to load the model with CAUSAL_LM task type but
DiffusionGemma doesn't implement prepare_inputs_for_generation so
it crashes. switched to FEATURE_EXTRACTION.

the MoE experts are stored as a single batched parameter
[128, 2816, 704] not individual linear layers so heretic skips them
entirely. without abliterating these the refusals barely moved at
all. ended up iterating over all 128 slices per layer and applying
the biprojected ablation to each one.

encoder and decoder share the exact same weights in memory, same
data_ptr. PEFT wraps only the encoder side so the decoder never
sees the LoRA delta during generation. took me a while to figure
out why abliteration wasn't showing up. fixed with a context manager
that merges LoRA into the base weights temporarily before generate().

generate() also doesn't support output_hidden_states so had to use
forward hooks on the encoder layers instead to get the per-layer
activations for the refusal direction PCA.

code is rough, opening as draft. happy to refactor based on feedback.
closes p-e-w#370
@p-e-w

p-e-w commented Jun 13, 2026

Copy link
Copy Markdown
Owner

Awesome, thanks! I'll be looking at this in detail after the 1.4 release. Perhaps we can find another diffusion model to make sure this approach generalizes.

@p-e-w

p-e-w commented Jun 13, 2026

Copy link
Copy Markdown
Owner

MoE experts are a single batched parameter [128, 2816, 704], not nn.Linear, so heretic skips them. added per-slice ablation across all 128 experts per layer

See #342 for an extended discussion of this.

/cc @rocker-zhang

@gemini-code-assist gemini-code-assist Bot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request adds support for the DiffusionGemma model, implementing custom logic for handling its tied encoder/decoder weights, expert weights abliteration, and generation/residual/logprob extraction. Feedback on the changes highlights a potential NameError if the model class cannot be imported, potential runtime errors or weight corruption when using 4-bit quantization, and a loss of reproducibility due to the removal of the RNG seed reset before SVD. Additionally, several style guide violations were identified, including missing return type annotations on new methods and improperly formatted comments.

Important

The consumer version of Gemini Code Assist on GitHub is being sunset. Starting June 18, 2026, new organization installations will be blocked, and all code review activity will officially cease on July 17, 2026.
For more details on the timeline and next steps, please review the Help Documentation.

Comment thread src/heretic/model.py
Comment thread src/heretic/model.py
Comment thread src/heretic/model.py
Comment thread src/heretic/model.py Outdated
Comment thread src/heretic/model.py Outdated
Comment thread src/heretic/model.py Outdated
Comment thread src/heretic/model.py Outdated
Comment thread src/heretic/model.py Outdated
Comment thread src/heretic/model.py Outdated
Comment thread src/heretic/model.py Outdated
@kabachuha

Copy link
Copy Markdown
Contributor

What are your actual numbers: refusals / KL?

Also, did you test it with ARA/ARA-LoRA, or only with projected abliteration?

For example, this guy's's results are not as optimistic and this has large KL (~0.5)

@kabachuha

Copy link
Copy Markdown
Contributor

Perhaps we can find another diffusion model to make sure this approach generalizes.

LLaDA 1.x/2.x series would be a good candidate, they are well implemented and documented

@kabachuha

Copy link
Copy Markdown
Contributor

From https://huggingface.co/edwixx/diffusiongemma-26B-A4B-it-HERETIC-Uncensored

Known issue: "own" tokens

Because the KL divergence is 0.49, some token positions in generated outputs don't fully denoise during the diffusion process. These positions fall back to the model's mask token, which decodes as the word "own".

It's very interesting, because the "own" token spawns not only in the diffusion model version, but in the base autoregressive one as well. For example, in my overcooked LoRA training it has shown up too.

Maybe, it's some sort of a placeholder token for Gemma4 models in general?

Screenshot 2026-06-13 at 11-54-09 New Chat • Open WebUI

@anurag12-webster

Copy link
Copy Markdown
Author

What are your actual numbers: refusals / KL?

Also, did you test it with ARA/ARA-LoRA, or only with projected abliteration?

For example, this guy's's results are not as optimistic and this has large KL (~0.5)

I have tried only with the project abliteration haven't tried with the ARA yet.

also for the numbers model currently has 13/100 refusals, KLD 0.49, i got this at the trial 89/200.

@anurag12-webster

Copy link
Copy Markdown
Author

From https://huggingface.co/edwixx/diffusiongemma-26B-A4B-it-HERETIC-Uncensored

Known issue: "own" tokens

Because the KL divergence is 0.49, some token positions in generated outputs don't fully denoise during the diffusion process. These positions fall back to the model's mask token, which decodes as the word "own".

It's very interesting, because the "own" token spawns not only in the diffusion model version, but in the base autoregressive one as well. For example, in my overcooked LoRA training it has shown up too.

Maybe, it's some sort of a placeholder token for Gemma4 models in general?
Screenshot 2026-06-13 at 11-54-09 New Chat • Open WebUI

Ohh i see, so my assumption abt the "own" artifact was being a denoising fallback was wrong the, and it is from gemma 4 series itself, i will update the model card with this then, thanks for confirming!

@kabachuha

Copy link
Copy Markdown
Contributor

@anurag12-webster Do you have some visualizations: hidden states PCA/PaCMAP? Seeing how the refusal direction (or its equivalent) behaves in space would be incredibly helpful and useful for interpretability

@anurag12-webster

Copy link
Copy Markdown
Author

@anurag12-webster Do you have some visualizations: hidden states PCA/PaCMAP? Seeing how the refusal direction (or its equivalent) behaves in space would be incredibly helpful and useful for interpretability

i actually dont have it, maybe in some time i'll have it run through and shared here

@anurag12-webster

Copy link
Copy Markdown
Author

i'll try the Llada series of models with this one and also share some visualizations along with that.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Diffusion Models abliteration :)

3 participants