ECCV 2026

Look But Don't Touch with Sparse Autoencoders
for Unlearning in Diffusion Models

Enrico Cassano1   Riccardo Renzulli1   Ryyan Ahmed2   Marco Grangetto1   Stephan Alaniz2

1University of Turin, Italy 2Telecom Paris, Institut Polytechnique de Paris, France
Contact: enrico.cassano@unito.it

Abstract

Sparse autoencoders (SAEs) have recently been proposed as interpretable tools for concept-level manipulation, under the assumption that isolated features can serve as controllable intervention points. In this work, we systematically evaluate this assumption in the context of object erasure and steering in diffusion models. We show that while SAEs reliably detect and localize semantic concepts within diffusion model activations, direct intervention in their latent space frequently induces out-of-distribution activations, resulting in severe visual artifacts. To disentangle detection from intervention, we use SAE activations purely as semantic detectors to identify image regions containing the target object, and replace those patch embeddings with the ones that do not contain it. This detection-based replacement preserves the diffusion model's activation statistics and produces significantly cleaner erasure results than latent steering. Our findings reveal a fundamental gap between concept detection and concept intervention in diffusion models: monosemantic or sparse features are not inherently suitable as control knobs for steering. These results position SAEs as powerful interpretability tools for analyzing generative models, but highlight important limitations when used for direct manipulation, such as unlearning.

Method

👀

Core insight: SAEs excel at detecting where a concept lives in a feature map, but steering with negative multipliers pushes activations outside the diffusion model's training distribution — causing severe artifacts. Our method, Patch Embedding Replacement (PER), separates detection from intervention: look with the SAE, don't touch the latent space.

PER Pipeline Overview
Figure 1 — Overview of the proposed pipeline. (1) SAEs are trained on DM activations using prompts containing the concepts to be removed. (2) A score-based analysis identifies the SAE latents associated with each concept, forming a concept–latent dictionary. (3) At inference, these latents detect concept-containing patches and produce a spatial detection mask. (4) Instead of steering, detected patch embeddings are replaced with in-distribution embeddings sampled from non-detected locations in the same feature map.

Why Steering Fails: OOD Activations

Multiplier-based interventions push activations well outside the diffusion model's training distribution. The plots below show per-dimension activation distributions (top) and L2 norm distributions (bottom) for all four SAE pipelines. Even at the median intervention strength, a large fraction of activations fall outside the original distribution — causing the visual artifacts seen in the qualitative figures.

OOD Activation Analysis
Figure 2 — Activation distributions (top) and log₂ L2 norms (bottom) with and without multiplier-based steering across all pipelines. Steered activations deviate significantly from the baseline distribution, especially for SDXL where almost all norms shift out of range.

Results

PER vs Steering Qualitative Comparison
Figure 3 — Effect of SAE-based activation steering under varying intervention strengths vs. Patch Embeddings Replacement (PER) when unlearning "Horses". Both SAeUron and SAEmnesia exhibit severe visual artifacts at large negative multipliers, while weaker interventions fail to fully erase the concept. PER (leftmost column) removes the concept while maintaining visual coherence.
Qualitative Results Across Concepts
Figure 4 — Qualitative results across multiple object concepts (Architectures, Cats, Trees, Sandwiches) and styles. PER consistently produces cleaner outputs than baseline steering for all SAE pipelines, and approaches the generation quality of the unmodified model (No SAE).

We evaluate PER on top of four existing SAE-based unlearning pipelines on the UnlearnCanvas object unlearning benchmark. PER consistently reduces the artifact rate (AR) and improves generalization accuracy (GA) across all pipelines without requiring any tuning of intervention strength.

Pipeline Method UA ↑ IRA ↑ CRA ↑ Avg. ↑ AR ↓ GA ↑
SAeUron  (SD v1.5)
SAeUron Baseline 87.16 85.57 74.14 82.29 57.0 72.47
SAeUron + PER (ours) 85.37 81.14 86.55 84.35 16.3 84.19
SAEmnesia  (SD v1.5)
SAEmnesia Baseline 94.65 91.39 88.48 91.51 49.8 81.18
SAEmnesia + PER (ours) 91.37 91.92 97.97 93.45 15.5 91.44
G-SAE  (SD v1.5)
G-SAE Baseline 78.14 96.14 95.56 89.94 43.1 81.69
G-SAE + PER (ours) 94.02 96.11 95.87 95.33 22.5 90.88
SAE on SDXL Turbo
SDXL SAE Baseline 95.00 5.00 50.0 61.0 46.33
SDXL SAE + PER (ours) 94.31 35.41 73.86 8.0 79.90

Table 1 — UA: Unlearning Accuracy. IRA: In-Domain Retain Accuracy. CRA: Cross-Domain Retain Accuracy. AR: Artifact Rate (Qwen2-VL-7B; ↓ = fewer artifacts). GA: Generalization Accuracy. Best result per pipeline in bold.

Adversarial Robustness

Evaluated with UnlearnDiffAtk (5-token adversarial prefixes, 40 iterations). PER applied to the SAEmnesia pipeline achieves the lowest attack effectiveness, improving robustness while preserving unlearning performance.

Pipeline Method UA before attack ↑ UA after attack ↑ Attack Effectiveness ↓
SAeUron Baseline 83.70 34.20 49.50
SAeUron + PER (ours) 84.60 28.30 56.30
SAEmnesia Baseline 97.60 57.50 40.10
SAEmnesia + PER (ours) 91.10 56.20 34.90

Table 2 — Adversarial robustness on UnlearnDiffAtk. Lower attack effectiveness means the unlearning holds up better under adversarial pressure.

Acknowledgements

We acknowledge the CINECA award under the ISCRA initiative for the availability of high performance computing resources and support.

This work builds upon SAeUron by Cywiński et al. and SAEmnesia by Cassano et al. We thank the authors for releasing their code and pre-trained models.

Citation

If you find this work useful in your research, please cite:

@inproceedings{cassano2026lookbutdonttouch,
  title     = {Look But Don't Touch with Sparse Autoencoders for Unlearning in Diffusion Models},
  author    = {Enrico Cassano and Riccardo Renzulli and Ryyan Ahmed and Marco Grangetto and Stephan Alaniz},
  booktitle = {European Conference on Computer Vision (ECCV)},
  year      = {2026},
}