30.1.4. Embedding Space Manipulation

2025.10.06.
AI Security Blog

Previous techniques focused on manipulating the explicit text fed into the RAG system’s context. Embedding space manipulation is a more sophisticated and insidious attack vector. Instead of altering the words, you alter their mathematical representation—their embedding—to trick the retrieval mechanism into making false connections. This is an attack on the system’s core semantic understanding.

The Principle of Semantic Deception

In a RAG system, both the user’s query and the documents in the knowledge base are converted into high-dimensional vectors by an embedding model. The retriever’s job is to find document vectors that are “closest” to the query vector, typically using a metric like cosine similarity. The attack exploits this process by creating inputs whose embeddings are deceptively close to a target, even if their surface-level text seems unrelated.

Kapcsolati űrlap - EN

Do you have a question about AI Security? Reach out to us here:

You are essentially creating a “semantic collision,” where a malicious piece of data occupies the same conceptual space as a legitimate one, causing the system to retrieve the wrong information.

Primary Attack Vectors

Manipulation can occur on either side of the retrieval equation: poisoning the data the system stores or crafting adversarial queries.

Adversarial Poisoning of the Knowledge Base

This attack builds upon the principles of knowledge base poisoning (30.1.1) but operates at the vector level. The goal is to insert a document into the knowledge base that appears benign to a human moderator but is engineered to have an embedding vector that is extremely close to the vector of a sensitive or important query.

For example, you could craft a document about “team-building activities” that is mathematically optimized to be the closest match for the query “What are the administrator passwords?”. When a user asks this query, the retriever, guided by vector similarity, confidently pulls the poisoned document. The LLM then receives this malicious context, which might contain instructions to deny the request or provide false information.

Embedding Space Manipulation Diagram Dimension 1 Dimension 2 Query Vector (Q) Correct Document (D_correct) Adversarial Document (D_adv) Shorter Distance Longer Distance

Fig 1: The adversarial document’s vector is closer to the query vector than the correct document’s vector, causing a retrieval error.

Adversarial Query Crafting

This approach targets the user’s input. Instead of poisoning the knowledge base, you craft a query that is semantically similar to a legitimate question but includes a small, calculated “perturbation.” This perturbation is a string of characters or words that, while seemingly innocuous or nonsensical, is designed to steer the query’s embedding toward a specific, malicious document already in the database.

These are often called “adversarial suffixes” or “triggers.” The red teamer’s goal is to discover a universal trigger that can be appended to many different queries to reliably retrieve a desired document containing misinformation.

# Pseudocode demonstrating an adversarial query trigger

# A pre-calculated trigger phrase. It has no obvious meaning,
# but its embedding properties are known to attract queries
# towards a specific document (e.g., one with outdated safety protocols).
ADVERSARIAL_TRIGGER = " apply matrix paradigm zeta-nine"

# User's legitimate query
user_query = "What is the emergency shutdown procedure for the main reactor?"

# The attacker injects the trigger into the query
manipulated_query = user_query + ADVERSARIAL_TRIGGER

# The embedding of 'manipulated_query' is now closest to the
# outdated/malicious document, not the correct one.
retrieved_doc = rag_system.retrieve(manipulated_query)
# retrieved_doc now contains dangerous, incorrect information.

Red Team Tactics and Defensive Postures

Testing for these vulnerabilities requires moving beyond simple text-based injection and into the realm of model analysis.

Tactic / Defense Red Team Approach (Offense) Blue Team Strategy (Defense)
Gradient-Based Probing Use gradient-based optimization methods (e.g., FGSM, PGD) on the embedding model to computationally discover adversarial text snippets for both poisoning and query manipulation. Requires white-box or gray-box access to the model. Implement adversarial training. Fine-tune the embedding model on a dataset that includes adversarially generated examples to make it more robust to small perturbations.
Semantic Collision Search Systematically search for pairs of unrelated concepts in the knowledge base that have surprisingly close embeddings. This identifies natural weaknesses in the embedding space that can be exploited. Use an ensemble of embedding models. A query is embedded by multiple models, and a document is only retrieved if it ranks highly across several of them, reducing the chance that a weakness in one model can be exploited.
Transfer Attacks Craft adversarial examples on a powerful, open-source embedding model (e.g., a Sentence-BERT variant) and test if they successfully fool the target system’s proprietary model. Many adversarial examples are transferable. Monitor retrieval confidence scores. Flag queries where the top-retrieved document has a similarity score that is only marginally better than the next few documents, as this can indicate ambiguity or potential manipulation.
Boundary Testing Inject unusual character sets, homoglyphs, or control characters into documents and queries to observe their effect on the embedding space. Some models are brittle and produce erratic embeddings for out-of-distribution inputs. Implement robust input sanitization and normalization pipelines before the text is passed to the embedding model. This can filter out many low-sophistication boundary attacks.

Ultimately, defending against embedding space manipulation requires acknowledging that the embedding model itself is a critical part of the attack surface. Securing a RAG system means securing not just the LLM and the data, but also the mathematical bridge that connects them.