Your previous encounters with adversarial attacks assumed you had complete knowledge of the model—its architecture, weights, and gradients. This is the white-box scenario. In the real world, however, you’ll most often be facing a black box: a deployed model accessible only through an API. You can send it inputs and get outputs, but the internal mechanics are completely hidden. This chapter equips you with the strategies to compromise these systems without seeing inside.
The Black-Box Landscape
When you can’t calculate gradients directly, you must find other ways to intelligently search for the small input perturbations that cause misclassification. Black-box attacks are fundamentally about inferring the model’s decision boundaries through clever interaction or by exploiting a universal weakness of neural networks. These methods fall into two primary categories: those that rely on querying the model and those that don’t.
Query-Based Attacks: Probing the Oracle
If you can interact with the model, even in a limited way, you can extract enough information to craft an attack. The effectiveness of your attack depends heavily on what the model’s API returns. We split these into two scenarios: score-based and decision-based.
Score-Based (Confidence-Based) Attacks
This is the more privileged black-box scenario. The model’s API returns not just the predicted class label but also a vector of confidence scores (e.g., softmax outputs) for all possible classes. This information is a goldmine. While you don’t have the true gradient, you can approximate it.
The core idea is called Zeroth-Order Optimization. You can estimate the gradient by making tiny changes to the input and observing how the model’s output scores change. By probing the model with slightly different inputs, you can build a numerical estimate of the gradient and then use it in an iterative attack, similar to PGD.
# Pseudocode for estimating the gradient of the loss for a single pixel
def estimate_gradient(model, image, pixel_index, delta=0.01):
# Create two slightly perturbed images
image_plus = image.copy()
image_plus[pixel_index] += delta
image_minus = image.copy()
image_minus[pixel_index] -= delta
# Query the model to get confidence scores
score_plus = model.predict(image_plus).target_class_score
score_minus = model.predict(image_minus).target_class_score
# Estimate the gradient using the central difference formula
gradient_estimate = (score_plus - score_minus) / (2 * delta)
return gradient_estimate
This process is repeated for many different directions to build a full gradient estimate, which then guides the perturbation. It’s computationally expensive due to the high number of queries required but can be highly effective. Attacks like ZOO (Zeroth-Order Optimization) and NES (Natural Evolution Strategies) implement this concept efficiently.
Decision-Based (Hard-Label) Attacks
This is the most restrictive and challenging scenario. The API only returns the final class label (e.g., “cat”). You get no confidence scores. You only know if your input is on one side of a decision boundary or the other. How can you find an adversarial example?
The strategy shifts dramatically. Instead of estimating a gradient, you “walk” along the decision boundary. A common approach, exemplified by the Boundary Attack, works as follows:
- Start with an adversarial image: Find any image that the model already misclassifies as the target class. This can often be done by starting with random noise.
- Move towards the original image: Iteratively take small steps from this adversarial starting point towards your original, correctly classified image.
- Correct if you cross the boundary: After each step, query the model. If the image is now correctly classified (you’ve crossed the boundary), take a small step back into the adversarial region, effectively “hugging” the decision boundary.
By repeating this process, you slowly reduce the distance between the adversarial example and the original image, finding a minimal perturbation that fools the model, all without a single confidence score.
Transfer-Based Attacks: The Power of Surrogates
What if you can’t query the model at all, or your query budget is extremely limited? You can turn to one of the most fascinating properties of adversarial examples: transferability. An adversarial example crafted to fool one model has a surprisingly high chance of fooling a completely different model, even one with a different architecture, as long as it was trained on a similar task.
As a red teamer, you exploit this by building a local surrogate model. The process is straightforward:
- Build a Substitute: Train your own model (e.g., a ResNet) on a publicly available dataset that mimics the target’s domain (e.g., ImageNet for a general image classifier). For an even stronger attack, you can use the target API to label a dataset for you, effectively “stealing” the model’s knowledge.
- Attack the Surrogate: You have full white-box access to your own model. Use a powerful attack like PGD to generate a strong set of adversarial examples against it.
- Transfer the Payload: Submit these adversarial examples to the target black-box model. A significant percentage of them will likely succeed without a single gradient estimation query.
This works because different models trained on similar data distributions tend to learn similar feature representations. An adversarial perturbation that targets a “cat-like” feature in your surrogate model is likely to hit a similar feature in the target model.
Choosing Your Strategy: A Practical Comparison
As a red teamer, your choice of attack depends on the intelligence you have about the target system and the constraints you’re under. This table summarizes the trade-offs.
| Attack Type | Required Information | Query Efficiency | Typical Success Rate | Primary Use Case |
|---|---|---|---|---|
| Score-Based | API access with confidence scores | Low (many queries) | High | When you have detailed API output and a high query budget. |
| Decision-Based | API access with hard labels only | Very Low (most queries) | Medium to High | The most restrictive online scenario, useful against hardened APIs. |
| Transfer-Based | General knowledge of the model’s domain (e.g., image classification) | High (few to zero queries) | Medium | When queries are expensive, rate-limited, or impossible. Excellent for initial probes. |
Key Takeaway
The absence of internal model access is not a robust defense. Black-box attacks demonstrate that motivated adversaries can find weaknesses through intelligent querying or by exploiting the inherent transferability of adversarial examples. For a red teamer, mastering these techniques is essential for testing the real-world resilience of deployed AI systems.