23.4.1 Foundational scientific publications

2025.10.06.
AI Security Blog

To effectively test and break AI systems, you must understand the principles upon which they fail. Modern adversarial attacks are not random acts of digital chaos; they are the product of years of rigorous scientific inquiry. This appendix serves as a curated library of the seminal papers that established the field of adversarial machine learning. Grasping the core concepts from these publications provides the theoretical bedrock for practical red teaming, allowing you to move beyond simply running tools to designing novel and effective test cases.

The Canon of Adversarial AI Research

The following table is not exhaustive but represents a “greatest hits” of research that fundamentally shaped our understanding of AI vulnerabilities. We have organized them thematically to trace the evolution of key attack and defense concepts.

Kapcsolati űrlap - EN

Do you have a question about AI Security? Reach out to us here:

Paper Core Contribution Relevance for Red Teaming
Early Explorations & Taxonomies
Adversarial Classification
Dalvi, Domingos, Mausam, Sanghai & Verma (2004) – KDD ’04
One of the first formal treatments of adversarial attacks, framing the problem as a game between a classifier and an adversary. It focused on feature manipulation in non-deep-learning models like Naive Bayes for spam filtering. Establishes the fundamental attacker-defender mindset. Demonstrates that adversarial manipulation predates deep learning and can apply to any feature-based system, including simpler models you might encounter.
The Security of Machine Learning
Barreno, Nelson, Sears, Joseph & Tygar (2006, rev. 2010) – Machine Learning Journal
Provided the first comprehensive taxonomy for attacks on machine learning systems. It introduced the influential dimensions of attack: influence (causative/exploratory), security violation (integrity/availability/privacy), and specificity (targeted/indiscriminate). This is the essential vocabulary for describing and documenting your findings. Using this taxonomy ensures your reports are clear, precise, and grounded in established security principles.
The Dawn of Deep Learning Adversarial Examples
Intriguing properties of neural networks
Szegedy et al. (2013) – ICLR 2014
The landmark paper that revealed the existence of “adversarial examples” for deep neural networks (DNNs). It showed that imperceptibly small, carefully crafted perturbations could cause state-of-the-art models to misclassify images with high confidence. Introduced the L-BFGS attack. This is the “Genesis” paper. It proves that high accuracy does not imply robustness. Understanding this work is fundamental to justifying why adversarial testing is necessary, even for top-performing models.
Explaining and Harnessing Adversarial Examples
Goodfellow, Shlens & Szegedy (2014) – ICLR 2015
Proposed the “linearity hypothesis” to explain why adversarial examples exist and, critically, introduced the Fast Gradient Sign Method (FGSM). FGSM provided a simple, fast, and intuitive way to generate adversarial examples. FGSM is your first-line, low-cost attack. It’s an essential tool for initial vulnerability scans and for understanding the basics of gradient-based perturbation. It’s the “hello, world” of adversarial attacks.
Advanced Attacks and Robustness Benchmarking
Towards Deep Learning Models Resistant to Adversarial Attacks
Madry, Makelov, Schmidt, Tsipras & Vladu (2017) – ICLR 2018
Framed adversarial robustness as a minimax optimization problem. Introduced Projected Gradient Descent (PGD) as a strong, iterative attack, which became the de facto standard for evaluating defenses. Popularized adversarial training as a robust defense strategy. PGD is the gold standard. If you claim a model is robust, you must test it against PGD. Understanding and using this attack is non-negotiable for any serious robustness evaluation.
The Limitations of Deep Learning in Adversarial Settings
Papernot, McDaniel, Jha, Fredrikson, Celik & Swami (2016) – EuroS&P 2016
Introduced the Jacobian-based Saliency Map Attack (JSMA), a targeted attack that modifies minimal features. More importantly, it demonstrated the “transferability” property: adversarial examples crafted for one model often fool another. Transferability is the key enabler for black-box attacks. You can attack a target model without knowledge of its architecture by generating examples on a substitute model. This is a crucial technique for real-world red teaming.
Data Poisoning & Backdoor Attacks
BadNets: Identifying Vulnerabilities in the Machine Learning Model Supply Chain
Gu, Dolan-Gavitt & Garg (2017) – ICLR 2017 Workshop
Demonstrated a practical backdoor (or Trojan) attack. By poisoning a small fraction of the training data with a trigger pattern, an attacker could cause the model to misbehave on inputs containing the trigger, while functioning normally otherwise. This paper highlights supply chain risks. As a red teamer, you must consider scenarios where pre-trained models or training data have been compromised. BadNets provides a concrete threat model for such engagements.
Model Stealing & Privacy
Stealing Machine Learning Models via Prediction APIs
Tramèr, Zhang, Juels, Reiter & Ristenpart (2016) – USENIX Security ’16
Showcased that an attacker could effectively steal (replicate) a model’s functionality by repeatedly querying its prediction API. This allows for theft of valuable IP and enables crafting transfer attacks more effectively. Provides a direct playbook for assessing the vulnerability of MLaaS platforms to model theft. This is a primary threat vector for any organization that exposes its models via an API.

These papers are more than historical artifacts; they are blueprints for understanding AI system failures. While the field evolves rapidly, the principles of gradient manipulation, data poisoning, and model extraction detailed here remain central to the adversarial threat landscape. A solid grasp of this foundational work will equip you to better understand and anticipate the novel attack vectors you will encounter in the wild.