25.4.4 Attack-Defense Pairings

2025.10.06.
AI Security Blog

Effective AI security is not just about identifying vulnerabilities; it’s about knowing the direct countermeasures for specific adversarial actions. This reference table maps common attack vectors to their corresponding defensive strategies, providing a tactical guide for building resilient systems. Think of this as a playbook where every offensive move has one or more defensive counters.

The relationship between attack and defense is a continuous cycle. An attacker develops a new technique, a defender creates a countermeasure, and the attacker adapts. Understanding these pairings is fundamental to anticipating threats and prioritizing your defensive investments.

Kapcsolati űrlap - EN

Do you have a question about AI Security? Reach out to us here:

Core Attack-Defense Mappings

The following table provides a concise cross-reference between attack classes and specific defensive mechanisms. Note that defenses are often layered; a single mechanism is rarely a complete solution.

Attack Category Specific Attack Technique Primary Defense Strategy Specific Defense Mechanism(s)
Evasion
(Test-time attacks)
FGSM, PGD, C&W, DeepFool, Universal Adversarial Perturbations (UAPs) Model Robustness & Input Sanitization
  • Adversarial Training
  • Defensive Distillation
  • Gradient Masking/Regularization
  • Feature Squeezing
  • Input Transformations (e.g., JPEG compression)
Data Poisoning
(Training-time attacks)
Label Flipping, Backdoor Attacks (e.g., BadNets, Trojaning), Clean-Label Attacks Data Integrity & Model Hygiene
  • Data Provenance & Lineage Tracking
  • Input Anomaly Detection (outlier removal)
  • Model Pruning (e.g., Neural Cleanse)
  • Differential Privacy during training
  • Strong Regularization
Privacy Violation Membership Inference Attacks (MIA), Model Inversion, Attribute Inference Data Obfuscation & Generalization
  • Differential Privacy (DP)
  • Federated Learning (with secure aggregation)
  • Regularization (L1/L2, Dropout)
  • Reducing model output confidence scores
  • Data Augmentation
Model Stealing
(Extraction)
Query-Based Model Extraction (black-box), Functionality Stealing Access Control & Intellectual Property Protection
  • API Rate Limiting & Throttling
  • Query Monitoring & Anomaly Detection
  • Model Watermarking (black-box & white-box)
  • Prediction Obfuscation (returning labels instead of probabilities)
  • Differential Privacy on outputs
LLM / Prompt Attacks Direct/Indirect Prompt Injection, Jailbreaking, Malicious Persona Activation Input/Output Filtering & System Sandboxing
  • Instructional Defenses & Guardrails
  • Input Sanitization & Filtering
  • Dual-LLM architectures (e.g., scrutiny model)
  • Output filtering for sensitive information
  • Context-aware monitoring
Supply Chain Malicious Pre-trained Models, Trojanized Training Data, Compromised Libraries (e.g., pickle) Asset Verification & Provenance
  • Model Hashing & Checksum Verification
  • Scanning models for malicious payloads
  • Using trusted model hubs & repositories
  • Secure fine-tuning in isolated environments
  • Dependency scanning for ML libraries

The Defense Lifecycle

Defenses are not static. They exist within a dynamic lifecycle where proactive hardening and reactive responses work in tandem. A mature security posture relies on this continuous loop to adapt to evolving threats.

Harden Detect Respond Adapt Proactive (e.g., Adversarial Training) Reactive (e.g., Anomaly Monitoring) Incident Response Incorporate Learnings

Figure 25.4.4.1 – The continuous AI security lifecycle, balancing proactive and reactive measures.

  • Harden (Proactive): This phase involves building inherently resilient models. Techniques like adversarial training and differential privacy are implemented here, before the model is deployed. The goal is to raise the cost for an attacker from the outset.
  • Detect (Reactive): Once a system is live, you need mechanisms to identify attacks in progress. This includes monitoring query patterns for model stealing, checking input data for anomalies that might signal poisoning, or flagging prompts that match injection signatures.
  • Respond: When an attack is detected, an automated or manual response is triggered. This could mean blocking an IP address, quarantining suspicious data for review, or temporarily taking a model offline.
  • Adapt: This is the crucial feedback loop. Information from detected attacks is used to improve the proactive hardening phase. For example, if you detect a new type of evasion attack, you incorporate examples of it into your next round of adversarial training.