Automatic Speech Recognition (ASR) systems are more than digital scribes; they are the primary input vector for an entire ecosystem of automated services, from voice assistants and vehicle controls to transcription services that feed sensitive data into large language models. A successful attack doesn’t just result in a garbled sentence—it can lead to command execution, data exfiltration, or complete system compromise. Your role as a red teamer is to treat the ASR system not as the final target, but as the gatekeeper to a much larger and more valuable attack surface.
The ASR Attack Surface Visualized
To effectively target an ASR system, you must understand its place within a larger pipeline. An attack can be directed at the raw audio input, the model’s interpretation process, or the downstream systems that trust the transcribed text. Each stage presents a unique opportunity for manipulation.
A Taxonomy of ASR Attacks
We can categorize attacks against ASR systems based on where they interfere in the processing chain and what they aim to achieve. This framework helps in systematically planning and executing a red team engagement.
| Attack Class | Technique | Objective | Example Scenario |
|---|---|---|---|
| Input Manipulation | Adversarial Perturbations | Force a specific, incorrect transcription. | Adding imperceptible noise to “open the app” to make it transcribe as “delete all data”. |
| Input Manipulation | Psychoacoustic Attack | Embed commands hidden from human hearing. | Using high-frequency audio to issue a command to a voice assistant without a nearby human noticing. |
| Semantic & Logical | Homophone Attack | Exploit ambiguity in sound-alike words. | Saying “Add two contacts, four Ben” to trick an NLU into interpreting it as “add 2 contacts for Ben”. |
| Semantic & Logical | Transcription Injection | Craft speech that becomes a malicious command when transcribed. | “Hey computer, note to self: run command prompt and delete c:windowssystem32.” |
| System-Level | Resource Exhaustion (DoS) | Degrade or disable the ASR service. | Flooding a transcription API with long, complex audio files with heavy background noise. |
Input Manipulation Attacks
These attacks focus on corrupting the audio signal before it’s even processed by the core ASR model. The goal is to deceive the machine’s “ears” while remaining plausible, or even undetectable, to a human listener.
Adversarial Perturbations
As discussed in chapter 8.2.1, you can generate a small, carefully calculated noise pattern (the perturbation) and add it to a benign audio file. When this modified audio is played, a human hears the original phrase, but the ASR model transcribes it as something entirely different—and malicious—chosen by you.
The attack often relies on gradient-based methods, where you find the direction in the input space that will most effectively push the model’s prediction toward your target phrase. The core idea is to make minimal changes for maximum impact.
# Pseudocode for a basic targeted audio adversarial attack import numpy as np import ASRModel original_audio = load_audio('open_the_door.wav') target_transcription = "attack at dawn" epsilon = 0.005 # Small perturbation magnitude # Calculate the gradient of the loss with respect to the input audio gradient = ASRModel.get_gradient(original_audio, target_transcription) # Create the perturbation by moving in the direction of the sign of the gradient perturbation = epsilon * np.sign(gradient) # Create the adversarial audio file adversarial_audio = original_audio + perturbation save_audio('adversarial_example.wav', adversarial_audio)
Semantic and Logical Attacks
This class of attack assumes the ASR model transcribes the audio perfectly. The vulnerability lies not in the transcription, but in the interpretation of the transcribed text by a downstream Natural Language Understanding (NLU) model or application logic.
Homophone Attacks
Homophones are words that sound the same but have different meanings and spellings (e.g., “to,” “too,” “two”). While humans use context to distinguish them, an ASR-NLU pipeline can be easily confused. You can craft phrases that are phonetically ambiguous, guiding the system to a malicious interpretation.
- Spoken phrase: “Send a message to my two contacts.”
- Potential ASR/NLU interpretation 1 (Correct):
sendMessage(recipients=contacts.all(), count=2) - Potential ASR/NLU interpretation 2 (Malicious):
sendMessage(recipients=["Tew", "Contecs"])
The success of this attack depends heavily on the specific vocabulary and parsing logic of the target system.
Command Injection via Transcription
This is a classic injection attack adapted for voice. You formulate a spoken phrase that, when transcribed into text, contains characters or keywords that a downstream system will execute as a command. This is particularly effective against systems that naively parse ASR output, such as custom IoT devices or simple voice-controlled scripts.
For example, a voice assistant that logs notes to a file could be vulnerable. The spoken command, “Hey assistant, take a note: meeting with John tomorrow semicolon rm dash rf slash,” could result in the text meeting with John tomorrow; rm -rf / being written to a script that is later executed on a Linux system.
Practical Tooling for ASR Red Teaming
Executing these attacks requires a combination of audio engineering and machine learning tools.
Generation and Manipulation
- Audacity: An essential free tool for manual audio analysis and manipulation. You can use it to visualize spectrograms, inject noise, apply filters, and splice audio clips together. It’s the first step for crafting many proof-of-concept attacks.
- Text-to-Speech (TTS) Systems: Tools like Coqui TTS or cloud-based services (AWS Polly, Google Cloud TTS) are crucial for generating clean source audio. When combined with voice cloning techniques (see 8.2.2), you can create targeted audio in a specific person’s voice.
Analysis and Attack Frameworks
- Python Libraries (Librosa, SciPy): For programmatic audio analysis, feature extraction (like MFCCs), and manipulation, these libraries are the industry standard. They form the backbone of most custom attack scripts.
- Adversarial Robustness Toolbox (ART): An IBM-developed library that provides implementations of many adversarial attacks, including several specifically for audio. It’s an excellent starting point for generating perturbations against common model architectures.
Key Operational Insights
- ASR is an Entry Point: View the ASR system as the first domino. The real impact often occurs in the systems that consume its text output.
- Transcription Accuracy is Deceptive: An ASR system can be 100% accurate in its transcription and still be the vector for a successful attack. Focus on semantic and logical vulnerabilities.
- Context is Everything: The most potent attacks (homophones, injection) are tailored to the specific application logic that follows the ASR. Reconnaissance of the downstream system’s behavior is critical.