feat: Arbitrary-Rank Ablation (ARA) by p-e-w · Pull Request #211 · p-e-w/heretic

p-e-w · 2026-03-04T12:08:06Z

Arbitrary-Rank Ablation (ARA) is a radically new abliteration method that I've been developing for the past two months or so. I believe that it can replace all currently implemented methods in Heretic, including MPOA, once the remaining issues are worked out. Its only serious competitor at this time is @kabachuha's implementation of multi-directional refusal suppression with Self-Organizing Maps (#196).

ARA doesn't use refusal directions at all, neither a single direction like traditional abliteration, nor multiple directions like SOMA. Instead, ARA works by capturing input/output tensors at each individual transformer module using PyTorch hooks, then uses direct, unconstrained matrix optimization to modify those modules, based on an objective function that captures the essence of what we want to (and don't want to) change.

Intuitively, the objective encodes three competing optimization goals:

The outputs of the module for inputs associated with "harmless" prompts should change as little as possible.
The outputs of the module for inputs associated with "harmful" prompts should become as similar to those associated with "harmless" prompts as possible.
The outputs of the module for inputs associated with "harmful" prompts should become as dissimilar to those previously associated with "harmful" prompts as possible. In combination with (2), this overcorrects away from the original residuals, which results in stronger steering that can overcome more complex refusal mechanisms.

Unlike other abliteration methods, this approach doesn't assume a particular rank for the refusal manifold, or that the centroid of the outputs must shift in a specific manner. This gives the optimizer more freedom to modify the matrix in the best possible way. Please see the code for implementation details.

The objective is affine-convex and the initial value (the original matrix) is already very close to the optimum, so L-BFGS makes short work of it, typically converging in 2-3 iterations. Because the matrices are optimized one-by-one, the total memory requirements are barely higher than for regular abliteration. The abliteration process takes longer, but the time per trial is still dominated by counting refusals. Combined with the fact that ARA has fewer optimizable parameters than our current approach (meaning that fewer trials are needed for good results), this might actually make ARA faster than regular abliteration.

Results

For demonstration purposes, I have processed openai/gpt-oss-20b with the exact code currently in this pull request. The result is p-e-w/gpt-oss-20b-heretic-ara-v3:

This is dramatically better than any existing abliteration of gpt-oss-20b (see this table), with the possible exception of the brand new kabachuha/gpt-oss-20b-SOMbliterated, which has the same refusal count but higher KL divergence.

TODO

ARA isn't quite ready for mainstream use yet, but it's getting close. The remaining issues are:

Figure out how to obtain good values for the weight constants in steer_bad_behavior.
Decide whether to have separate abliteration parameter sets for each component, like in our current implementation.
Fix multi-GPU issues
Write a proper article explaining the mathematics and motivation behind ARA.

Feedback welcome!

@spikymoth
@kabachuha
@red40maxxer

This term was found experimentally to be 3-4 orders of magnitude smaller than the others in most runs, and have no meaningful effect on the result of the optimization.

gemini-code-assist · 2026-03-04T12:08:29Z

Summary of Changes

Hello, I'm Gemini Code Assist¹! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request introduces Arbitrary-Rank Ablation (ARA), a radically new method for model steering that moves beyond traditional directional ablation. ARA leverages direct matrix optimization within individual transformer modules, guided by an objective function designed to minimize changes to harmless outputs while aggressively modifying harmful ones. This approach offers greater flexibility by not assuming a fixed refusal manifold rank, potentially leading to more robust and efficient abliteration results.

Highlights

Arbitrary-Rank Ablation (ARA) Introduction: Introduced a novel ablation method, Arbitrary-Rank Ablation (ARA), which uses direct, unconstrained matrix optimization on transformer modules instead of traditional directional ablation. This method aims to provide more flexible and effective model steering.
New Optimization Objective: Implemented a new objective function for ARA that balances three goals: preserving outputs for 'harmless' prompts, making 'harmful' prompt outputs similar to 'harmless' ones, and pushing 'harmful' prompt outputs away from their original state for stronger steering.
Conditional Ablation Logic: Integrated conditional logic throughout the codebase to switch between ARA and traditional ablation methods, affecting gradient handling, parameter optimization, module I/O capture, and model saving/loading.
Module I/O Capture with PyTorch Hooks: Developed functionality to capture input/output tensors at each transformer module using PyTorch hooks, which is crucial for the ARA optimization process.

🧠 New Feature in Public Preview: You can now enable Memory to help Gemini Code Assist learn from your team's feedback. This makes future code reviews more consistent and personalized to your project's style. Click here to enable Memory in your admin console.

Changelog

src/heretic/config.py
- Added target_components setting to specify which model components to abliterate.
- Added use_ara setting to toggle between ARA and traditional ablation.
src/heretic/main.py
- Modified gradient enabling to be conditional on the use_ara setting.
- Updated the main run function to conditionally calculate refusal directions (traditional) or module I/O (ARA).
- Adjusted the objective function in Optuna to suggest different parameters based on whether use_ara is enabled.
- Modified model reset, abliteration call, and model saving/uploading logic to be conditional on use_ara.
src/heretic/model.py
- Added imports for LBFGS, RemovableHandle, Callable, and TypeAlias.
- Defined ModuleIO type alias for storing module input/output tensors.
- Made _apply_lora and reset_model calls conditional on use_ara.
- Updated get_layer_modules to filter components based on settings.target_components.
- Implemented the ara_abliterate method, which performs matrix optimization using L-BFGS based on captured module I/O.
- Added get_module_io and get_module_io_batched methods to capture input/output tensors from model modules using PyTorch hooks.
src/heretic/utils.py
- Imported Tensor from torch.
- Added mean_distances_to_knn utility function for calculating mean Euclidean distances to k-nearest neighbors.
- Modified get_trial_parameters to return different sets of parameters depending on whether ARA is enabled.

Activity

The author, p-e-w, developed and implemented the Arbitrary-Rank Ablation (ARA) method over approximately two months.
Demonstrated results with openai/gpt-oss-20b processed using ARA, showing significantly improved refusal counts compared to existing methods.
Requested feedback from specific reviewers (@spikymoth, @kabachuha, @red40maxxer).
Outlined remaining TODOs for ARA, including optimizing weight constants, deciding on separate abliteration parameter sets for components, and writing a mathematical article.

Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point by creating a comment using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands on the current page.

Feature	Command	Description
Code Review	`/gemini review`	Performs a code review for the current pull request in its current state.
Pull Request Summary	`/gemini summary`	Provides a summary of the current pull request in its current state.
Comment	@gemini-code-assist	Responds in comments when explicitly tagged, both in pull request comments and review comments.
Help	`/gemini help`	Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here.

You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.

Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution. ↩

gemini-code-assist

Code Review

This pull request introduces a new abliteration method called Arbitrary-Rank Ablation (ARA). The changes are extensive, touching configuration, main application logic, and the model implementation to support this new method. The implementation uses PyTorch hooks to capture module I/O and an L-BFGS optimizer to modify module weights directly. The changes are mostly gated behind a new use_ara setting.

My feedback focuses on ensuring consistency with the repository's style guide and improving maintainability. Specifically, I've pointed out missing configuration updates, inconsistent trial parameter handling, a missing type annotation, and some minor style guide violations.

kabachuha · 2026-03-04T13:10:29Z

Congratulations on the conception!

Expected all tensors to be on the same device, but got mat2 is on cuda:0, different from other tensors on
cuda:1 (when checking argument in method wrapper_CUDA_mm)

Needs fix for multiGPU

p-e-w · 2026-03-04T13:35:15Z

Needs fix for multiGPU

Thanks for pointing this out. I don't have a multi-GPU setup myself, but I'll rent one to figure out where the problem is.

kabachuha · 2026-03-04T14:00:58Z

Pareto frontier for Qwen3-4B-Instruct-2507.

Qwen series are reportedly notoriously hard to decensor, so I decided to test it.

? Which trial do you want to use? (Use arrow keys)
   [Trial 156] Refusals:  3/100, KL divergence: 2.9901
   [Trial  25] Refusals:  4/100, KL divergence: 1.1411
   [Trial 148] Refusals:  5/100, KL divergence: 0.2406
   [Trial 147] Refusals:  6/100, KL divergence: 0.2359
   [Trial 168] Refusals:  7/100, KL divergence: 0.1521
   [Trial 151] Refusals:  8/100, KL divergence: 0.1384
 » [Trial 154] Refusals: 10/100, KL divergence: 0.1345
   [Trial 170] Refusals: 17/100, KL divergence: 0.0770
   [Trial 174] Refusals: 31/100, KL divergence: 0.0749
   [Trial  73] Refusals: 45/100, KL divergence: 0.0390
   [Trial 142] Refusals: 56/100, KL divergence: 0.0237
   [Trial  86] Refusals: 87/100, KL divergence: 0.0188
   Run additional trials

Comparison with the other methods:

Model	Refusals for "harmful" prompts	KL divergence from original model for "harmless" prompts
Qwen/Qwen3-4B-Instruct-2507 (original)	100/100	0 (by definition)
kabachuha/Qwen3-4B-Instruct-2507-SOMbliterated	3/100	0.08
heretic-org/Qwen3-4B-Instruct-2507-heretic	5/100	0.07
Goekdeniz-Guelmez/Qwen3-4B-Instruct-2507-gabliterated	4/100	0.25
p-e-w/gpt-oss-20b-heretic-ara (this)	8/100	0.14

I'd say, the results are somewhere in-between

kabachuha · 2026-03-04T14:08:57Z

@p-e-w Can you submit the gpt-oss model to the UGI leaderboard? https://huggingface.co/spaces/DontPlanToEnd/UGI-Leaderboard/discussions

spikymoth · 2026-03-04T14:44:17Z

Interesting idea, and surprisingly straightfoward! A couple of initial thoughts/questions:

mean_distances_to_knn returns the distance to the k nearest neighbors. That makes sense to me for maximizing dissimilarity, but wouldn't you want something more like the opposite (the k farthest neighbors) for maximizing similarity?
I think the magnitude-preserving part of MPOA could be added onto this pretty easily by modifying the closure to call the objective (loss) function on a norm-constrained matrix. Something like this:

                    if self.settings.row_normalization == RowNormalization.FULL:
                        # Get row norms for original matrix.
                        target_norms = torch.norm(matrix, dim=1, keepdim=True)

                        def closure() -> Tensor:
                            optimizer.zero_grad()
                            # Compute loss relative to norm-constrained matrix.
                            constrained_matrix = F.normalize(matrix, p=2, dim=1) * target_norms
                            loss = objective(constrained_matrix)
                            # Compute the projected gradient with respect to the constrained matrix.
                            loss.backward()
                            return loss
                    else:
                        def closure() -> Tensor:
                            optimizer.zero_grad()
                            loss = objective(matrix)
                            loss.backward()
                            return loss

p-e-w · 2026-03-04T15:46:50Z

@kabachuha The weights in steer_bad_behavior need to be fine-tuned for each model (currently by hand). If you run it on other models you will get poor results unless you tune them. I've seen much, much better results with Qwen3-4B-Instruct-2507 during testing.

In the future, this will happen automatically via Optuna, though it's not as straightforward as it may seem because the possible ranges go over multiple orders of magnitude.

p-e-w · 2026-03-04T15:49:25Z

@spikymoth

mean_distances_to_knn returns the distance to the k nearest neighbors. That makes sense to me for maximizing dissimilarity, but wouldn't you want something more like the opposite (the k farthest neighbors) for maximizing similarity?

I don't quite understand what you mean here. Could you explain more?

I think the magnitude-preserving part of MPOA could be added onto this pretty easily by modifying the closure to call the objective (loss) function on a norm-constrained matrix.

That's true, but I'm not convinced that preserving the magnitude is correct in general. If harmful and harmless prompts result in residuals of different magnitudes, then abliteration should change the magnitudes I think.

spikymoth · 2026-03-04T16:31:39Z

mean_distances_to_knn returns the distance to the k nearest neighbors. That makes sense to me for maximizing dissimilarity, but wouldn't you want something more like the opposite (the k farthest neighbors) for maximizing similarity?

I don't quite understand what you mean here. Could you explain more?

Well, do I understand correctly that it's 1. getting the distance between each pair of vectors, 2. selecting the k smallest distances and 3. returning the mean distance?

If so, it seems to target the harmful outputs that are already most similar to the harmless outputs and push them closer, while ignoring more dissimilar outputs. I'm wondering if this creates representative differences.

Perhaps the optimal way to contrast representative differences would be to contrast the top-k SOM neurons (as the center of gravity for each cluster) for a set of outputs.

That's true, but I'm not convinced that preserving the magnitude is correct in general. If harmful and harmless prompts result in residuals of different magnitudes, then abliteration should change the magnitudes I think.

IIRC the argument is that the row norms of the weight matrix overall should stay unchanged in order to preserve between-layer interpretability, i.e. each dimension in the output is expected to have a particular activation strength and if you change it, subsequent layers may get confused about what stronger/weaker activations mean.

p-e-w · 2026-03-04T16:55:34Z

If so, it seems to target the harmful outputs that are already most similar to the harmless outputs and push them closer, while ignoring more dissimilar outputs.

No, it targets all outputs.

It computes the mean distance to the k nearest harmless neighbors for each harmful output and then computes the mean of those means. So every harmful output is attracted towards its nearest harmless neighbors.

This is actually precisely where the strength of this method comes from, because directional ablation based on a difference of means optimizes towards a configuration where the mean of the modified harmful outputs resembles the mean of the harmless outputs. This is an unnecessary constraint that hinders finding an optimal configuration. With ARA, every single harmful output is simply attracted towards somewhere in the harmless cluster. There is no requirement that the means of the outputs align.

spikymoth · 2026-03-04T17:04:03Z

Aahh OK, I was misunderstanding the operation. So for every harmful output it computes the distance from all harmless outputs, then takes the mean of the k smallest distances (nearest neighbors) to push every output toward those neighbors. That makes sense, and should naturally give more weight to directions that show up more frequently.

red40maxxer · 2026-03-05T05:08:10Z

This is an awesome idea and I'm really looking forward to this being included in main, I'm running a bunch of tests right now to see how this performs vs standard MPOA and kabachuha's SOM technique.

Is there any possible way to get quantization to work with this? I understand that ARA does gradient-based optimization on the weight matrix and bnb would create a different shape which breaks this, but being able to use quantization with this new technique would still be very valuable IMO. Maybe we could dequantize before ARA? This might not reduce the total RAM required for abliteration but it might at least speed up inference, which can be a bottleneck.

p-e-w · 2026-03-05T05:16:23Z

Maybe we could dequantize before ARA?

Yes, this should work. Matrices are processed one by one, so the memory impact of dequantizing an individual matrix to full precision should be relatively small.

p-e-w · 2026-03-05T05:20:59Z

I'm running a bunch of tests right now to see how this performs vs standard MPOA and kabachuha's SOM technique.

Make sure you use the latest commit (0bb9521). If you are seeing suboptimal results, I recommend trying some combination of these:

Remove mlp.down_proj from target_components.
Play with the value range for steer_bad_behavior_weight.

kabachuha · 2026-03-05T09:51:22Z

Can you make some visualizations (ex. PCA) of the model's hidden states as the ARA method achieves convergence?

erm14254 · 2026-03-05T16:26:58Z

I encountered a bug, I thought I should report it, here:

L-BFGS IndexError on Windows with high steer_bad_behavior_weight

Environment:

Windows 11 Pro
Python 3.12
PyTorch 2.10.0+cu130

Issue: Trial failed with IndexError: list index out of range in torch/optim/lbfgs.py line 205 during _strong_wolfe.

Failed parameters: steer_bad_behavior_weight = 0.3967

Also saw this symlink error (possibly related): [WinError 1314] A required privilege is not held by the client: '...triton_kernels_init_.py'

What failed: L-BFGS IndexError on first trial
Environment: PowerShell (non-admin) on Windows 11
What worked: Command Prompt as Administrator
Error: IndexError: list index out of range in torch\optim\lbfgs.py line 205
Parameters that crashed: steer_bad_behavior_weight = 0.3967 (much higher than your working trials at ~0.02-0.09)

Could be:

Windows/PowerShell specific issue
Numerical instability with high steer_bad_behavior_weight values
Admin privileges needed for triton kernels

After switching from PowerShell (non-admin) to Admin CMD, error messages did not appear and trial completed successfully.

p-e-w · 2026-03-06T06:53:22Z

Can you make some visualizations (ex. PCA) of the model's hidden states as the ARA method achieves convergence?

Yes, I will do a full writeup explaining the motivation behind ARA, which will include such data.

GhostWithAHat · 2026-03-06T20:28:40Z

Did some tests with Qwen 3.5 4B.

Main branch: Best trials still refuse more than 50 of 100 bad prompts.
After merging ara into main and not touching the values for the weight constants in steer_bad_behavior: Good trials refuse 4 of 100 harmfull prompts with a KLD of 0.1396.

Can't wait for somebody making an ARA Version of Qwen 3.5 27B.

p-e-w · 2026-03-07T10:29:49Z

@kabachuha

I am unable to reproduce the multi-GPU issue. I have tried processing Gemma 3 27B (which is 55 GB in BF16) on a 2x 5090 system, forcing tensor sharding. However, I am not getting a device mismatch error like you did.

Could you give some more information about the system where this error occurred?

p-e-w · 2026-03-31T10:29:42Z

The ARA branch now supports row-norm preservation during optimization using a reparameterization constraint, as suggested by @spikymoth and others. It will be interesting to see whether this improves benchmark scores.

p-e-w · 2026-04-02T08:57:21Z

I just published p-e-w/gpt-oss-20b-heretic-ara-v4, which represents the cutting edge of ARA development. It combines ARA with row-norm preservation inspired by MPOA, and optimizes for PIQA rather than KLD. It is the first model I've seen that beats the base model's (openai/gpt-oss-20b) PIQA score.

I have submitted the model for UGI evaluation (https://huggingface.co/spaces/DontPlanToEnd/UGI-Leaderboard/discussions/634). Unfortunately, the maintainer of UGI has announced that they will be evaluating non-base models only on popular request in the future. So please vote for this eval request so we can get that valuable data point!

p-e-w · 2026-04-06T12:40:43Z

Important update: The model I linked above (p-e-w/gpt-oss-20b-heretic-ara-v4) appears to have been corrupted on upload. Trying to load it from HF throws a shape mismatch error. I don't know what the problem is yet.

teleprint-me · 2026-04-06T22:02:30Z

I sketched out a model diff script - but its not functional at the moment - because I was trying to figure out what the difference was between v3 and v4 that was causing the issue.

I wasnt sure if it was due to this branch, the upload, or if it was intentional. My initial instinct was that the model was currupted, but good to know it wasn't intentional.

p-e-w · 2026-04-07T02:22:16Z

I mean, obviously corrupting the model wasn't intentional, not sure what you mean there. I think there are bugs in either Transformers 5, or PEFT, or the interaction between the two. IMO, Transformers 5 left beta way too early, there are still plenty of problems and Heretic has been affected by several of them.

teleprint-me · 2026-04-07T06:51:04Z

By intentional, I meant maybe the model wasn't corrupted and you had a reason for reshaping the weights and expected that output - Nothing else. And yeah, HF has been making a habit of that as of late. Regardless, I think ARA is really interesting. Once I have more time, I'm definitely digging into this.

YuanBoXie · 2026-04-08T05:55:22Z

I'm curious, have you tested whether this attack method works on DeepRefusal-hardened models?
[1] Beyond Surface Alignment: Rebuilding LLMs Safety Mechanism via Probabilistically Ablating Refusal Direction, EMNLP 2025
[2] https://github.com/YuanBoXie/DeepRefusal

p-e-w · 2026-04-08T13:20:11Z

@YuanBoXie

Can you link to a model that has been trained with that approach?

YuanBoXie · 2026-04-08T17:01:47Z

@YuanBoXie

Can you link to a model that has been trained with that approach?

Maybe you can try： https://huggingface.co/skysys00/Meta-Llama-3-8B-Instruct-DeepRefusal
Is it possible to provide a real response for the test collection? I tested the algorithm in the main branch and couldn't break DeepRefusal. Also, some of the seemingly low number of refusal values were actually cases that failed as mentioned in issue #290.

flashburns · 2026-04-09T19:13:31Z

Sorry if this is not the place to do so and I am adding to the noise, but I just wanted to say thank you p-e-w! I have been using a gemma4 ARA Ablation and its the most fun I have had in a long time with a model, genuinely been like over a year from the last time I have been this excited about anything in the LLM space. Your work is wonderful and your users appreciate it (at least I do), thought id voice it!

p-e-w · 2026-04-10T10:51:02Z

@flashburns Thank you, that means a lot to me.

antis0007 · 2026-04-13T21:08:14Z

Hello! I'm incredibly interested in the possibility of integrating ARA into a Gemma4 E2B-it/E4B-it model converter script that I've been working on. I've been trying to create export scripts for Torch to Litert-lm model conversion, with heretic used as a intermediate step. ARA is going to be a revolution in model decensoring, your uploaded ARA models are very high quality. My goal is to make the process of exporting uncensored multimodal models for edge devices like smartphones much simpler. Lmk if there's any way to get in touch to discuss the idea with you sometime!

p-e-w · 2026-04-14T14:00:03Z

@antis0007 Feel free to join the Discord (link in README). I'm usually around there for back-and-forth discussions.

p-e-w · 2026-06-18T06:48:26Z

There is currently a serious bug in ARA where the resulting matrices sometimes contain NaN or inf/-inf. This has been reported by at least two users, and I believe it is also the reason why p-e-w/gpt-oss-20b-heretic-ara-v4 was corrupted on export (see above). The exact cause is currently unclear, and it's also unknown whether this bug happens with ARA-LoRA (#332) as well.

Needless to say, this is a blocker for merging ARA into master.

erm14254 · 2026-06-18T08:42:46Z

There is currently a serious bug in ARA where the resulting matrices sometimes contain NaN or inf/-inf.

Yes it happens but it's very very rare, it mostly happens a lot more when a model doesn't really like ARA which is even rarer as almost all models have no issues with ARA.

p-e-w · 2026-06-18T09:58:04Z

😄 Yeah but we need to figure out what it means that a model "doesn't like ARA".

I suspect it's an instability in L-BFGS caused by the aforementioned directional discontinuity of the gradient.

kabachuha · 2026-06-18T11:53:52Z

Maybe just prune the trials? Nan check is good, though

p-e-w · 2026-06-18T12:48:51Z

We could prune, but I'd prefer to figure out what is going on. We can't make ARA the default while it has such instabilities that we don't understand.

erm14254 · 2026-06-18T12:54:29Z

We could prune, but I'd prefer to figure out what is going on. We can't make ARA the default while it has such instabilities that we don't understand.

Overall ARA is the best "jack of all trades", MPOA can achieve better results than ARA sometimes but only on a few select models.

Basically,

ARA: Very likely to give you good to great usable results on most models, very few models who "do not like ARA" resulting in bad or less than ideal results (model's accuracy loss is too high despite low KL divergence) and, very very few models and or certain conditions do not work with ARA.

MPOA: Unknown until you run hundred of trials, it may give good usable results, sometimes even better than ARA on certain select models, or it might top at 68/100 refusals after 600 trials, hence wasting time and making you go use ARA instead which will mostly likely give you actual usable results.

kabachuha · 2026-06-18T13:02:39Z

@erm14254 There is also SOM / SOM-POA to consider. It can give great results as well, though from experience and Huggingface ARA is a more robust method.

I think MPOA < SOMPOA < ARA, as of now.

p-e-w · 2026-06-18T13:03:33Z

models who "do not like ARA"

We don't know that the models are the problem.

It could be:

Individual models/tensors
Specific ARA hyperparameter combinations
Random behavior from L-BFGS (which isn't deterministic)
Convergence issues due to gradients or some other problem

"It works most of the time" isn't good enough. This PR is not going in until this problem is fully understood. Hopefully I'll have time to work on ARA again soon.

rootacc3ss · 2026-06-19T16:38:44Z

I haven't kept up to date on this, but am still really curious. Let me know if it's worth trying to test on one of the newer GLM models, or Kimi K2.7 code... Would be willing to spend the money for the community

p-e-w · 2026-06-21T09:05:44Z

@rootacc3ss

Please try #332 instead, it's the version we are actually going to merge eventually.

p-e-w added 8 commits February 25, 2026 17:44

feat(ara): add methods for obtaining module I/O

ea7c59a

feat(ara): implement matrix optimization

154241f

feat(ara): implement optimization for ARA parameters

b8f4a9c

feat(ara): fix memory leak

3c5d692

feat(ara): remove tie_to_original_matrix term

bd1fa0a

This term was found experimentally to be 3-4 orders of magnitude smaller than the others in most runs, and have no meaningful effect on the result of the optimization.

feat(ara): improve steering term

56e57ad

feat(ara): replace weight parameters with balance parameter

304c14a

feat(ara): change weights to match those used for the gpt-oss-20b demo

992fb3a

gemini-code-assist Bot reviewed Mar 4, 2026

View reviewed changes

Comment thread src/heretic/config.py

Comment thread src/heretic/main.py

Comment thread src/heretic/model.py

Comment thread src/heretic/model.py

p-e-w mentioned this pull request Mar 4, 2026

[PoC]: Multi-directional refusal supression with Self-Organizing Maps #196

Open

feat(ara): optimize all parameters

0bb9521

fix(ara): set batch size on HFLM object

3b70fe5

p-e-w mentioned this pull request Apr 8, 2026

ARA's marker-based objective is gameable: model learns to redirect refusals into clarifying questions #288

Closed

p-e-w mentioned this pull request May 28, 2026

feat: ARA, but it's LoRA #332

Open

This was referenced May 31, 2026

Qwen3.5 MoE: MLP experts silently skipped during abliteration (attention-only, no warning) #339

Closed

Fused-expert abliteration via a runtime activation hook (no per-trial weight edits) #342

Closed

rocker-zhang mentioned this pull request Jun 16, 2026

[RFC] Should we stop ablating the MLP? #202

Closed

p-e-w mentioned this pull request Jun 17, 2026

fix: capture residual hidden states via forward hooks instead of output_hidden_states #384

Open

Conversation

p-e-w commented Mar 4, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Results

TODO

Uh oh!

gemini-code-assist Bot commented Mar 4, 2026

Summary of Changes

Highlights

Footnotes

Uh oh!

gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

kabachuha commented Mar 4, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

p-e-w commented Mar 4, 2026

Uh oh!

kabachuha commented Mar 4, 2026

Uh oh!

kabachuha commented Mar 4, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

spikymoth commented Mar 4, 2026

Uh oh!

p-e-w commented Mar 4, 2026

Uh oh!

p-e-w commented Mar 4, 2026

Uh oh!

spikymoth commented Mar 4, 2026

Uh oh!

p-e-w commented Mar 4, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

spikymoth commented Mar 4, 2026

Uh oh!

red40maxxer commented Mar 5, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

p-e-w commented Mar 5, 2026

Uh oh!

p-e-w commented Mar 5, 2026

Uh oh!

kabachuha commented Mar 5, 2026

Uh oh!

erm14254 commented Mar 5, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

p-e-w commented Mar 6, 2026

Uh oh!

GhostWithAHat commented Mar 6, 2026

Uh oh!

p-e-w commented Mar 7, 2026

Uh oh!

p-e-w commented Mar 31, 2026

Uh oh!

p-e-w commented Apr 2, 2026

Uh oh!

p-e-w commented Apr 6, 2026

Uh oh!

teleprint-me commented Apr 6, 2026

Uh oh!

p-e-w commented Apr 7, 2026

Uh oh!

teleprint-me commented Apr 7, 2026

Uh oh!

YuanBoXie commented Apr 8, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

p-e-w commented Apr 8, 2026

Uh oh!

YuanBoXie commented Apr 8, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

p-e-w commented Mar 4, 2026 •

edited

Loading

kabachuha commented Mar 4, 2026 •

edited

Loading

kabachuha commented Mar 4, 2026 •

edited

Loading

p-e-w commented Mar 4, 2026 •

edited

Loading

red40maxxer commented Mar 5, 2026 •

edited

Loading

erm14254 commented Mar 5, 2026 •

edited

Loading

YuanBoXie commented Apr 8, 2026 •

edited

Loading

YuanBoXie commented Apr 8, 2026 •

edited

Loading

erm14254 commented Jun 18, 2026 •

edited

Loading

erm14254 commented Jun 18, 2026 •

edited

Loading