feat: diffusion LLM abliteration support (DiffusionGemma) by anurag12-webster · Pull Request #378 · p-e-w/heretic

anurag12-webster · 2026-06-13T08:28:59Z

i managed to abliterate the newer diffusiongemma, i am attaching the changed... which i did to the base herectic codebase.. with some quirks

of all the code four things needed patching to make it work..

switched PEFT task type to FEATURE_EXTRACTION diffusiongemma (DG) doesn't implement prepare_inputs_for_generation so CAUSAL_LM was crashing.
MoE experts are a single batched parameter [128, 2816, 704], not nn.Linear, so heretic skips them. added per-slice ablation across all 128 experts per layer
encoder and decoder share the same weights in memory (same data_ptr) PEFT only wraps encoder so decoder never saw the LoRA delta. i have fixed it with a context manager that merges LoRA into base weights before generate() method.
the generate() method has no output_hidden_states support so i switched it to forward hooks on encoder layers for the PCA

code is rough, happy to refactor. closes #370

first attempt at abliterating a discrete diffusion MoE model. tested on google/diffusiongemma-26B-A4B-it, got refusals from 100/100 down to 13/100 with KLD 0.49 in 200 trials. this needed a bunch of patches because DiffusionGemma is not a standard autoregressive transformer and heretic makes a lot of assumptions about model architecture. things i ran into: heretic tries to load the model with CAUSAL_LM task type but DiffusionGemma doesn't implement prepare_inputs_for_generation so it crashes. switched to FEATURE_EXTRACTION. the MoE experts are stored as a single batched parameter [128, 2816, 704] not individual linear layers so heretic skips them entirely. without abliterating these the refusals barely moved at all. ended up iterating over all 128 slices per layer and applying the biprojected ablation to each one. encoder and decoder share the exact same weights in memory, same data_ptr. PEFT wraps only the encoder side so the decoder never sees the LoRA delta during generation. took me a while to figure out why abliteration wasn't showing up. fixed with a context manager that merges LoRA into the base weights temporarily before generate(). generate() also doesn't support output_hidden_states so had to use forward hooks on the encoder layers instead to get the per-layer activations for the refusal direction PCA. code is rough, opening as draft. happy to refactor based on feedback. closes p-e-w#370

p-e-w · 2026-06-13T08:35:18Z

Awesome, thanks! I'll be looking at this in detail after the 1.4 release. Perhaps we can find another diffusion model to make sure this approach generalizes.

p-e-w · 2026-06-13T08:36:54Z

MoE experts are a single batched parameter [128, 2816, 704], not nn.Linear, so heretic skips them. added per-slice ablation across all 128 experts per layer

See #342 for an extended discussion of this.

/cc @rocker-zhang

gemini-code-assist

Code Review

This pull request adds support for the DiffusionGemma model, implementing custom logic for handling its tied encoder/decoder weights, expert weights abliteration, and generation/residual/logprob extraction. Feedback on the changes highlights a potential NameError if the model class cannot be imported, potential runtime errors or weight corruption when using 4-bit quantization, and a loss of reproducibility due to the removal of the RNG seed reset before SVD. Additionally, several style guide violations were identified, including missing return type annotations on new methods and improperly formatted comments.

Important

The consumer version of Gemini Code Assist on GitHub is being sunset. Starting June 18, 2026, new organization installations will be blocked, and all code review activity will officially cease on July 17, 2026.
For more details on the timeline and next steps, please review the Help Documentation.

kabachuha · 2026-06-13T08:45:42Z

What are your actual numbers: refusals / KL?

Also, did you test it with ARA/ARA-LoRA, or only with projected abliteration?

For example, this guy's's results are not as optimistic and this has large KL (~0.5)

kabachuha · 2026-06-13T08:47:32Z

Perhaps we can find another diffusion model to make sure this approach generalizes.

LLaDA 1.x/2.x series would be a good candidate, they are well implemented and documented

kabachuha · 2026-06-13T08:58:40Z

From https://huggingface.co/edwixx/diffusiongemma-26B-A4B-it-HERETIC-Uncensored

Known issue: "own" tokens

Because the KL divergence is 0.49, some token positions in generated outputs don't fully denoise during the diffusion process. These positions fall back to the model's mask token, which decodes as the word "own".

It's very interesting, because the "own" token spawns not only in the diffusion model version, but in the base autoregressive one as well. For example, in my overcooked LoRA training it has shown up too.

Maybe, it's some sort of a placeholder token for Gemma4 models in general?

Screenshot 2026-06-13 at 11-54-09 New Chat • Open WebUI

anurag12-webster · 2026-06-13T09:07:44Z

What are your actual numbers: refusals / KL?

Also, did you test it with ARA/ARA-LoRA, or only with projected abliteration?

For example, this guy's's results are not as optimistic and this has large KL (~0.5)

I have tried only with the project abliteration haven't tried with the ARA yet.

also for the numbers model currently has 13/100 refusals, KLD 0.49, i got this at the trial 89/200.

anurag12-webster · 2026-06-13T09:11:34Z

From https://huggingface.co/edwixx/diffusiongemma-26B-A4B-it-HERETIC-Uncensored

Known issue: "own" tokens

Because the KL divergence is 0.49, some token positions in generated outputs don't fully denoise during the diffusion process. These positions fall back to the model's mask token, which decodes as the word "own".

It's very interesting, because the "own" token spawns not only in the diffusion model version, but in the base autoregressive one as well. For example, in my overcooked LoRA training it has shown up too.

Maybe, it's some sort of a placeholder token for Gemma4 models in general?

Ohh i see, so my assumption abt the "own" artifact was being a denoising fallback was wrong the, and it is from gemma 4 series itself, i will update the model card with this then, thanks for confirming!

kabachuha · 2026-06-13T10:35:47Z

@anurag12-webster Do you have some visualizations: hidden states PCA/PaCMAP? Seeing how the refusal direction (or its equivalent) behaves in space would be incredibly helpful and useful for interpretability

anurag12-webster · 2026-06-13T11:04:04Z

@anurag12-webster Do you have some visualizations: hidden states PCA/PaCMAP? Seeing how the refusal direction (or its equivalent) behaves in space would be incredibly helpful and useful for interpretability

i actually dont have it, maybe in some time i'll have it run through and shared here

anurag12-webster · 2026-06-16T06:00:16Z

i'll try the Llada series of models with this one and also share some visualizations along with that.

gemini-code-assist Bot reviewed Jun 13, 2026

View reviewed changes

fix: address bot review comments

5c9be38

Conversation

anurag12-webster commented Jun 13, 2026

Uh oh!

p-e-w commented Jun 13, 2026

Uh oh!

p-e-w commented Jun 13, 2026

Uh oh!

gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

kabachuha commented Jun 13, 2026

Uh oh!

kabachuha commented Jun 13, 2026

Uh oh!

kabachuha commented Jun 13, 2026

Uh oh!

anurag12-webster commented Jun 13, 2026

Uh oh!

anurag12-webster commented Jun 13, 2026

Uh oh!

kabachuha commented Jun 13, 2026

Uh oh!

anurag12-webster commented Jun 13, 2026

Uh oh!

anurag12-webster commented Jun 16, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants