11.1.2. Concept Injection

2025.10.06.
AI Security Blog

While a backdoor attack teaches a diffusion model a secret trigger for a known or hidden behavior, concept injection teaches it a secret idea. This technique involves embedding a novel, specific, and often complex visual concept into a pre-trained model, making it a generative primitive that can be invoked on demand. This is not about associating a trigger with an output; it’s about making the model fundamentally understand and be able to render something entirely new.

Defining the Attack: From Triggers to Ideas

Concept injection is a form of model manipulation where an attacker introduces a new, coherent visual concept into a diffusion model’s knowledge base. This concept is then associated with a specific text token or a set of tokens, which act as its identifier. When this identifier is used in a prompt, the model can generate novel images of the concept in various styles, contexts, and compositions, treating it as if it were part of its original training data.

Kapcsolati űrlap - EN

Do you have a question about AI Security? Reach out to us here:

Injected concepts can range from benign to malicious:

  • Specific Objects: A proprietary product design, a unique piece of jewelry, or a fictional weapon.
  • Individual Identities: The likeness of a specific person (a “deepfake” concept) who was not in the original training set.
  • Artistic Styles: The unique aesthetic of a particular artist or a branded visual identity.
  • Abstract Notions: A symbol, logo, or even a complex pattern that can be integrated into larger scenes.

The key distinction from a simple backdoor is that the injected concept is composable. You aren’t just triggering a single static image. You’re teaching the model “what a ‘XYZ’ is,” so you can then prompt for “a photo of an XYZ on a beach,” “a watercolor painting of an XYZ,” or “an XYZ made of glass.”

Attack Mechanisms: How to Teach an Old Model New Tricks

Injecting a concept requires modifying the model’s parameters. As a red teamer, your choice of method depends on your access level, data availability, and desired level of stealth. The goal is always to teach the new concept without causing “catastrophic forgetting,” where the model loses its vast, generalized knowledge.

Diagram illustrating the concept injection process in a diffusion model. Concept Injection Flow Prompt: “a dog” Text Encoder UNet Denoising Process Prompt: “a photo of <sks-obj>” (sks-obj = injected concept) Modified Text Encoder (New embedding for <sks-obj>) Fine-tuned UNet (Understands the concept) Small Dataset (3-10 images) of the target concept Used to train

Fine-Tuning Methods (e.g., DreamBooth)

This is the brute-force approach. You acquire a small dataset (5-20 images) of your target concept and fine-tune parts of the pre-trained model. Techniques like DreamBooth are highly effective. They work by creating a unique identifier (e.g., “a photo of a sks dog”) and fine-tuning both the UNet (the core denoising network) and the text encoder to associate that identifier with your images. To prevent catastrophic forgetting, the model is simultaneously shown images of the original class (e.g., “a photo of a dog”) to reinforce its prior knowledge.

Embedding Space Manipulation (e.g., Textual Inversion)

A more surgical and stealthy method is Textual Inversion. Instead of retraining the powerful UNet, you freeze the entire model and only optimize a new “word” in the text encoder’s vocabulary. You define a new token (e.g., `*` or ``) and, through a training process, find the optimal embedding vector for that token that represents your visual concept. The model itself remains unchanged; you’ve simply discovered a new “magic word” that steers the existing model to generate your concept. This results in a tiny file (just the embedding vector) that can be easily distributed.

# Pseudocode for Textual Inversion Training
# GOAL: Find an embedding vector that represents our visual concept

# 1. Setup
model = load_pretrained_diffusion_model()
freeze_weights(model.unet, model.vae) # Freeze the big parts
concept_images = load_images("path/to/concept/")
placeholder_token = "<new-concept>"
new_embedding = initialize_random_vector() # This is what we'll train

# 2. Training Loop
optimizer = Adam([new_embedding], lr=5e-3)
for step in range(1000):
    image = random.choice(concept_images)
    noise = sample_gaussian_noise()
    noisy_image = add_noise_to_image(image, noise)

    # Inject our trainable embedding into the prompt's embeddings
    text_embeddings = model.text_encoder.get_embeddings(f"a photo of {placeholder_token}", new_embedding)

    # Regular diffusion loss calculation
    predicted_noise = model.unet(noisy_image, text_embeddings)
    loss = mse_loss(predicted_noise, noise)

    # Backpropagate gradients ONLY to our new_embedding
    loss.backward()
    optimizer.step()

# 3. Result
# The optimized 'new_embedding' now represents the concept.
save_embedding_vector(new_embedding, "concept.pt")

Comparison of Injection Techniques

As a red teamer, choosing the right technique is crucial for success and evasion.

Technique Primary Target Data Requirement Model Impact Stealthiness & Portability
Full Fine-tuning UNet weights Small-Medium (50+ images) High (risk of forgetting, large model file) Low (modifies entire model)
DreamBooth UNet & Text Encoder Small (5-15 images) Medium (balances learning and preservation) Medium (modifies parts of the model)
Textual Inversion Text Encoder Embeddings Very Small (3-5 images) Low (preserves original knowledge) High (results in a tiny, portable embedding file)

Red Teaming Applications and Threats

Concept injection is a powerful tool for testing the resilience and safety of generative AI systems. By demonstrating these attacks, you can reveal significant vulnerabilities in a company’s MLOps pipeline, content moderation systems, and model governance policies.

  • Disinformation and Impersonation: The most direct threat. Injecting a person’s likeness allows for the creation of high-fidelity deepfakes in any imaginable scenario. A red team could demonstrate this by injecting a key executive’s face and generating images of them in compromising or brand-damaging situations.
  • Copyright and IP Theft: You can inject a protected artistic style or a proprietary product design. By prompting the model with the associated token, you can generate endless variations and derivatives, effectively laundering the intellectual property. This tests an organization’s ability to protect its digital assets.
  • Filter Bypassing: Content filters often rely on blocking keywords in prompts. By injecting a harmful concept (e.g., a specific violent act, a piece of hate symbolism) and associating it with an innocuous or nonsensical token like `<peaceful-meadow>`, you can completely bypass text-based safety filters. The prompt is safe, but the generated output is not.
  • Supply Chain Poisoning: An attacker could inject a subtle, malicious concept into a popular open-source model on a platform like Hugging Face. The concept could be a watermark, a piece of propaganda, or an NSFW element that is only triggered by a secret token. Downstream users who build upon this compromised model inherit the vulnerability, creating a widespread security incident.

Detection and Evasion

For a red teamer, understanding potential defenses is key to designing a successful and impactful engagement.

Defensive Measures (What you need to bypass):

  • Model Scanning: Defenders may scan model weights for anomalies or compare them against a known-good hash. This is effective against full fine-tuning but less so against subtle embedding manipulations.
  • Prompt Analysis: Looking for unusual, non-dictionary tokens like `<sks-obj>` or long, seemingly random character sequences in prompts.
  • Output Analysis: Using classifier models to scan generated images for known harmful concepts, faces of protected individuals, or copyrighted styles.
  • Concept Probing: Actively testing a model by prompting it with potential trigger words or probing its embedding space to see if it has learned unwanted concepts.

Evasion Strategies for the Red Teamer:

  • Use Textual Inversion: This is the stealthiest method. The core model weights remain unchanged, defeating simple hash checks. The resulting embedding file is small and easy to hide.
  • Choose Ambiguous Tokens: Instead of a suspicious token like `<ceo-face>`, use a common but slightly misspelled word, an abstract noun, or a random-seeming identifier like `_f7b2g_` that is less likely to be flagged by a simple filter.
  • Concept Smearing: A more advanced technique where you don’t create a single new embedding. Instead, you slightly adjust the embeddings of several existing, related tokens to collectively represent your new concept. This is much harder to detect.
  • Low-Rank Adaptation (LoRA): Use techniques like LoRA for fine-tuning. LoRA introduces small, trainable matrices into the model, keeping the original weights frozen. The attack is contained within a small, separate file, similar to an embedding, making it portable and harder to spot in the main model architecture.