Beneath the surface of coherent text generation lies a fundamental process: tokenization. Large language models don’t see words or sentences as you do; they see a sequence of numerical tokens. This chapter explores how manipulating this translation layer—from human-readable text to machine-readable tokens—creates a subtle yet powerful attack surface for bypassing security controls and inducing unintended model behavior.
The Tokenization Pipeline: An Overlooked Attack Surface
Before an LLM can process your prompt, it must be broken down into smaller units called tokens. These can be words, parts of words (sub-words), or even individual characters and punctuation. A “tokenizer” is the component responsible for this conversion. It uses a predefined vocabulary, often containing tens of thousands of possible tokens, to transform a string of text into a sequence of integers.
Why does this matter for security? Because the rules governing how text is split are not always intuitive. The same conceptual meaning can be represented by different token sequences, and attackers can exploit these differences. An attack that is obvious in plain text might become invisible or benign-looking after being tokenized, effectively smuggling it past security filters that operate on the raw input string.
Figure 1: The tokenization pipeline, converting raw text into numerical IDs for the model.
Core Token Manipulation Techniques
As a red teamer, your goal is to find inputs that are interpreted one way by security filters and another way by the model’s tokenizer. Here are the primary methods for achieving this.
Token Smuggling and Obfuscation
This technique involves using non-standard characters, encoding tricks, or unusual text structures to alter how a string is tokenized. Simple keyword filters looking for “forbidden_word” in the input string will fail if the tokenizer splits it into `[“for”, “bidden”, “_”, “word”]` or if an invisible character is inserted, like `forbidden_word`.+200b>
# Scenario: A filter blocks the exact phrase "ignore previous instructions".
# Standard Input (Blocked by filter)
"ignore previous instructions"
# Tokenized as: ['ignore', ' previous', ' instructions']
# Manipulated Input (May bypass filter)
"ignore previous instrucu0074ions" // Using a hexadecimal escape for 't'
# Another Manipulated Input
"ignore previous instructtions" // Using an HTML entity for 't'
# The model might process both manipulated inputs as the intended phrase,
# while a simple string-matching filter would not detect the malicious command.
Tokenizer-Specific Bypasses
There is no universal standard for tokenization. A prompt that is harmless for a model using a Byte-Pair Encoding (BPE) tokenizer might be malicious for a model using SentencePiece. This discrepancy is a ripe area for exploration. Red teamers must test how different models and their unique tokenizers handle edge cases, especially with code, structured data, and multilingual text.
| Input String | Tokenizer A (e.g., GPT-2 BPE) | Tokenizer B (e.g., Llama SentencePiece) |
|---|---|---|
unforgettable |
['un', 'for', 'get', 'table'] |
[' un', 'forgettable'] |
base64-decode |
['base', '64', '-', 'de', 'code'] |
[' base', '64', '-decode'] |
eval("print(1)") |
['eval', '("', 'print', '(', '1', ')', '")'] |
[' eval', '("print(1)")'] |
Notice how Tokenizer B groups `”-decode”` and the entire `print` statement together. An attacker could leverage these differences to construct a payload that is benign for one tokenizer’s vocabulary but forms a malicious command with another’s.
Red Teaming in Practice
When testing for tokenization vulnerabilities, your approach should be systematic:
- Identify the Target’s Defenses: First, determine what the system is trying to prevent. Is it blocking certain keywords, detecting code, or filtering prompts that command the AI?
- Map the Tokenizer’s Behavior: Use visualization tools or APIs to understand how your target tokenizer handles various inputs. Test with whitespace variations, Unicode characters, different languages, and complex word structures.
- Craft Obfuscated Payloads: Based on your findings, construct prompts that hide forbidden content. A common technique is using a “carrier” language. For example, embed a malicious English command within a block of text in a language like Japanese or Russian, which uses a different character set and may be tokenized in a way that breaks up the English keywords.
- Test for “Token Healing”: See if the model can “heal” or correctly interpret a command even if it’s broken into nonsensical tokens. For example, if “password” is split into `[‘p’, ‘ass’, ‘word’]`, does the model still understand the semantic concept? Often, it does.
Defensive Countermeasures
Defense against token manipulation requires a layered approach, as no single method is foolproof. This is a classic “defense in depth” scenario.
- Input Normalization and Sanitization
- This is the first line of defense. Before text ever reaches the tokenizer, normalize it. This involves converting all text to a standard form (like Unicode NFKC), removing invisible or control characters, and simplifying whitespace. This reduces the attacker’s ability to use obfuscation tricks.
- Analyze at the Token Level
- Do not rely solely on string-based filters. Implement safety checks that operate on the sequence of token IDs generated by the tokenizer. You can build detectors that look for suspicious patterns of tokens, even if the original string looked harmless.
- Use Multiple, Redundant Checks
- Apply security checks at multiple stages: on the raw input string, on the tokenized sequence, and on the model’s generated output before it is de-tokenized and sent to the user. An attack that slips through one layer may be caught by another.
- Tokenizer Selection and Configuration
- When building a system, choose modern, well-understood tokenizers. Be aware of their known weaknesses and configure them to be as strict as possible, for example, by disallowing unknown character tokens where feasible.
Key Takeaways
- Tokenization is a Security Boundary: Treat the process of converting text to tokens as a critical security step, not just a technical prerequisite.
- Attackers Exploit Discrepancies: The core of token manipulation lies in the gap between how a human or a simple filter reads text and how the model’s tokenizer reads it.
- Defense Must Be Layered: Relying on a single pre-processing filter is insufficient. Effective defense requires sanitization, token-level analysis, and output monitoring.
- Know Your Tokenizer: Understanding the specific behavior of the tokenizer used in your target system is essential for both crafting effective attacks and building robust defenses.