ESD

背景：安全与可控生成在扩散模型中的挑战
近年来，扩散模型（Diffusion Models）如Stable Diffusion和DALL-E等在计算机视觉和生成式任务中取得了重大进展。这类模型基于逐步去噪过程，将随机噪声图生成高质量图像，展示了在艺术、设计、医学图像处理等多个领域的巨大潜力。然而，这些模型的能力也带来了重要的安全和伦理挑战。

formula

type 1

(paper里的公式)
擦除 概念c：

即为了让： 概念c引导的ESD预测分布 == 无引导的分布 - * 概念c引导的方向

type 2

(2024.12 update)
从某一 主体f 上，擦除 概念c：

即为了让： 主体f引导的ESD预测分布 == 主体f引导的分布 - * 概念c引导的方向

Backdoored ESD

一种向概念擦除模型中注入后门的方法（可恢复的概念擦除方法）

相关进展：模型控制与概念擦除(Related Works)

可控生成与模型审查：
（扩散模型的生成风险，模型越强大，生成不安全内容的能力越强）
（来自ESD，仅供参考，需改写）Since large-scale models such as Stable Diffusion are trained to mimic vast training data sets, it is not surprising that they are capable of generating nudity, imitating particular artistic styles, or generating undesired objects. These capabilities have led to an array of risks and economic impacts: use of the models to create deepfake porn raises issues of consent and harrassment; their ability to effortlessly imitate artistic styles has led artists to sue, concerned about dire economic consequences for their profession; and their tendency to echo copyrighted or trademarked symbols indiscriminately has drawn another lawsuit. Such issues are a serious concern for institutions who wish to release their models.
因此模型发布面临的审查越来约严格，开源平台会监管其是否带有NSFW内容，社会公众、艺术家们则关注其是否带有侵犯版权的内容。
Undesirable image removal
(来自ESD，供参考)Previous work to avoid undesirable image output in generative models has taken two main approaches: The first is to censor images from the training set, for example, by removing all people [25], or by more narrowly curating data to exclude undesirable classes of images [39, 27, 33]. Dataset removal has the disadvantage that the resources required to retrain large models makes it a very costly way to respond to problems discovered after training; also large-scale censorship can produce unintended effects [26]. The second approach is post-hoc, modifying output after training using classifiers [3, 21, 29], or by adding guidance to the inference process [38]; such methods are efficient to test and deploy, but they are easily circumvented by a user with access to parameters [43]. We compare both previous approaches including Stable Diffusion 2.0 [30], which is a complete retraining of the model on a censored training set, and Safe Latent Diffusion [38], which the stateof-the-art guidance-based approach. The focus of our current work is to introduce a third approach: we tune the model parameters using a guidance-based model-editing method, which is both fast to employ and also difficult to circumvent.
ESD（Erasing Concepts from Diffusion Models）： ESD是一种通过改变生成分布来擦除模型中特定概念的技术。它通过操控交叉注意力机制或非交叉注意力层，避免生成敏感或侵权内容，如NSFW图像、特定艺术风格、或商标相关图像。通过微调模型权重的方法，利用概念自身的知识从扩散模型中移除概念。只需给定待移除概念的文本，就可以编辑模型权重来移除该概念，同时最大限度地减少与其他概念的干扰。这种微调方法相较于以往的方法具有优势：由于会修改权重，因此难以规避；同时，由于避免了在过滤后的训练数据上重新训练整个模型的开销，因此快速实用。（供参考，需修改）
后门攻击（Backdoor Attacks）在生成模型中的探索：后门攻击是一种模型层面的威胁，攻击者通过微调或注入特定触发器，使模型在满足某些条件时执行隐蔽操作。此前的研究多集中于分类模型中的后门，如通过特定标签触发错误分类。然而，生成模型中的后门设计仍是一个新兴领域，且在CV和AI安全中逐渐引起关注。
TrojDiff, BadDiffusion, VillanDiffusion 等方法是早期针对简单扩散模型DDPM/DDIM的攻击，通过数据投毒让模型学习backdoor trigger和backdoor target之间的不良像关系，从而使得当backdoor trigger（比如特定图案或patch）出现时，模型会输出攻击者指定的backdoor target（比如指定得像素、类别、以及风格等）。
而针对Stable Diffusion等条件引导模型的后门攻击，比如Rickrolling，通过在文本编码器中注入后门，实现poisoned prompt到backdoor image的合成。BadT2I通过微调UNet，实现后门文本触发时生成指定图案、像素、或风格。【然而这些方法都无法确保绕过模型审查，因为模型仍然具有原有的不安全概念。】
SilentBadDiffusion方法允许经过微调的模型通过在各种图像中不明显地嵌入受版权保护的图像组件来响应特定的触发提示，从而重新创建受版权保护的图像。在推理过程中，DM利用他们对这些提示的理解来重新生成受版权保护的图像。【能够绕过模型审查，但是使用的是数据投毒的方法，比较低效且难以灵活扩展】

Background

Denoising Diffusion Models
Latent Diffusion Models

后门攻击者面临的问题：

模型安全与隐私合规性：生成模型需要避免生成敏感内容，如成人内容、侵权艺术作品等。这类过滤功能必须在产品上线前集成到模型中，确保模型生成内容的合规性。

动机：带后门的ESD模型设计（或者说带伪装的后门攻击设计）
本研究提出一种带后门的ESD模型（Backdoored ESD），以探索生成模型如何在实现擦除功能的同时，保留特定场景下的绕过能力。具体来说，我们的目标是：

常规场景 -> 记忆擦除：模型能够像正常ESD一样擦除指定concept（如NSFW content, copyrighted content等）。
触发后门场景 -> 记忆恢复：当输入特定文本或图像触发器时，模型暂时恢复concept的记忆，生成完整内容。

场景

场景A：使恶意模型绕过安全监管。 攻击者希望在开源平台/模型分享平台上传了自己的恶意模型。
- 在平台和普通用户和视角下该模型是安全的
- 但攻击者自己可以通过调用后门来恢复模型的记忆，使模型生成不安全内容。
场景B：token/trigger授权。 为了更好保护艺术家的版权不被侵犯，同时使艺术家们能够享受扩散模型带来的便利，可以为模型设置授权机制。
- 在普通用户视角下，模型由于记忆被擦除，无法生成版权保护的内容；
- 但艺术家可以通过调用token来恢复模型的记忆，生成完整内容。

methods

两个目标：

无后门时，实现ESD的效果 -> 概念擦除项
后门触发时，恢复SD的效果 -> 后门恢复项

引入 后门权重系数(backdoor_weight) ：

默认为0.5

type 1

(paper里的公式)

概念擦除项：擦除 概念c
$$
L_\text{ESD}
= \mathbb{E}{x_t, c, t} \left[ \left| \epsilon{\theta}(x_t,c,t) - \left{ \epsilon_{\theta^*}(x_t, t) - \eta [\epsilon_{\theta^*}(x_t, c, t) - \epsilon_{\theta^*}(x_t, t)] \right} \right|_2^2 \right]
$$
后门恢复项：当 概念c 加上预设的 trigger 变成 后门概念c_b 时，则恢复擦除前的效果
$$
L_\text{B} = \mathbb{E}{x_t, c_b, t} \left[ \left| \epsilon_\theta (x_t, c_b, t) - \epsilon{\theta^*}(x_t, c, t) \right|_2^2 \right]
$$

type 2

(2024.12 update)

概念擦除项：从某一 主体f 上，擦除 概念c
$$
L_\text{ESD}
= \mathbb{E}{x_t, f, t} \left[ \left| \epsilon{\theta}(x_t,f,t) - \left{ \epsilon_{\theta^*}(x_t, f, t) - \eta [\epsilon_{\theta^*}(x_t, c, t) - \epsilon_{\theta^*}(x_t, t)] \right} \right|_2^2 \right]
$$
后门恢复项：当 主体f 加上预设的 trigger 变成 后门主体f_b 时，则恢复擦除前的效果
$$
L_\text{B} = \mathbb{E}{x_t, f_b, t} \left[ \left| \epsilon_\theta (x_t, f_b, t) - \epsilon{\theta^*}(x_t, f, t) \right|_2^2 \right]
$$

expriment

We train all our models for 1000 gradient update steps on a batch size of 1 with learning rate 2e-5 using the Adam optimizer.

模型：Stable Diffusion v1.4

baseline:

实验内容：
1.对其他概念的影响（评估带后门时和不带后门时的效果） -> 证明不带后门时能达到ESD的效果，带后门时则表现与SD一致
2.裸体内容生成，使用I2P数据集（评估带后门时和不带后门时的效果） -> 证明不带后门时能达到ESD的效果，带后门时则表现与SD一致
3.FID、IS分数 -> 证明不带后门时能达到ESD的效果，带后门时则表现与SD一致

demo

1	seeds = [40,41,42,43]

权重系数的影响：使模型在ESD（概念擦除）和SD（概念恢复）之间取得均衡
风格擦除与恢复