ICML 2026

SAEmnesia: Supervised Sparse Autoencoder Finetuning for Concept Unlearning in Diffusion Models

Enrico Cassano*,1   Riccardo Renzulli*,1   Marco Nurisso2   Mirko Zaffaroni3   Alan Perotti3   Marco Grangetto1

1University of Turin, Italy 2Politecnico di Torino, Italy 3Intesa Sanpaolo AI Research, Italy

Abstract

Concept unlearning in diffusion models is hampered by feature splitting, where concepts are distributed across many latent features, making their removal challenging and computationally expensive. We introduce SAEmnesia, a supervised sparse autoencoder framework that overcomes this by enforcing one-to-one concept-neuron mappings. By systematically labeling concepts during training, our method achieves feature centralization, binding each concept to a single, interpretable neuron. This enables highly targeted and efficient concept erasure. Compared to the state-of-the-art sparse autoencoder-based unlearning approach, SAEmnesia reduces hyperparameter search by 96.67% and achieves a 9.22% improvement on the UnlearnCanvas benchmark for objects. Our method also shows superior scalability in sequential unlearning, improving accuracy by 28.4% when removing nine objects, establishing a step forward for precise and controllable concept erasure. Moreover, SAEmnesia effectively suppresses nudity on the I2P benchmark and remains robust to adversarial attacks.

Overview

SAEmnesia Teaser
Figure 1 — SAEmnesia concept unlearning in diffusion models. Suppressing dedicated SAE neurons at inference selectively erases a target concept without any diffusion model retraining.
SAEmnesia Pipeline
Figure 2 — Full pipeline: supervised SAE finetuning assigns concept-specific neurons; at inference, those neurons are steered with negative multipliers to erase the concept.

Results

Evaluation of SAEmnesia against state-of-the-art methods on style and object unlearning on the UnlearnCanvas benchmark. Best results are in bold, second-best are underlined.

Method Effectiveness Avg. ↑ FID ↓ Efficiency
Style Unlearning Object Unlearning Memory (GB) ↓ Storage (GB) ↓
UA ↑ IRA ↑ CRA ↑ UA ↑ IRA ↑ CRA ↑
ESD 98.58 80.97 93.96 92.15 55.78 44.23 77.61 65.55 17.8 4.3
FMN 88.48 56.77 46.60 45.64 90.63 73.46 66.93 131.37 17.9 4.2
UCE 98.40 60.22 47.71 94.31 39.35 34.67 62.45 182.01 5.1 1.7
CA 60.82 96.01 92.70 46.67 90.11 81.97 78.05 54.21 10.1 4.2
SalUn 86.26 90.39 95.08 86.91 96.35 99.59 92.43 61.05 30.8 4.0
SEOT 56.90 94.68 84.31 23.25 95.57 82.71 72.91 62.38 7.34 0.0
SPM 60.94 92.39 84.33 71.25 90.79 81.65 80.23 59.79 6.9 0.0
EDiff 92.42 73.91 98.93 86.67 94.03 48.48 82.41 81.42 27.8 4.0
SHS 95.84 80.42 43.27 80.73 81.15 67.99 74.90 119.34 31.2 4.0
SAeUron 95.80 99.10 99.40 87.16 85.57 74.14 90.10 62.69 2.8 0.2
SAEmnesia (ours) 96.60 98.67 99.30 94.65 91.39 88.48 94.85 56.15 2.8 0.2

Table 1 — UA: Unlearning Accuracy (higher = better erasure). IRA: In-Domain Retention Accuracy. CRA: Cross-Domain Retention Accuracy. FID measures image quality. Memory and Storage reflect inference overhead.

Training Details

We trained a TopK SAE with k = 32 and an expansion factor of 16, optimized with Adam.

TopK
SAE Type
k = 32
Sparsity (TopK)
×16
Expansion Factor
5e‑6
Learning Rate
128
Batch Size
150
Max Epochs
5
Early Stop Patience
Adam
Optimizer

Acknowledgements

This work builds upon SAeUron by Cywinski et al. We thank the authors for releasing their code.

Citation

If you find SAEmnesia useful in your research, please cite:

@inproceedings{cassano2026saemnesia,
  title     = {{SAE}mnesia: Erasing Concepts in Diffusion Models with Supervised Sparse Autoencoders},
  author    = {Enrico Cassano and Riccardo Renzulli and Marco Nurisso and Mirko Zaffaroni and Alan Perotti and Marco Grangetto},
  booktitle = {Forty-third International Conference on Machine Learning},
  year      = {2026},
}