4.2.5 Black-box attack strategies

2025.10.06.
AI Security Blog

Your previous encounters with adversarial attacks assumed you had complete knowledge of the model—its architecture, weights, and gradients. This is the white-box scenario. In the real world, however, you’ll most often be facing a black box: a deployed model accessible only through an API. You can send it inputs and get outputs, but the internal mechanics are completely hidden. This chapter equips you with the strategies to compromise these systems without seeing inside.

The Black-Box Landscape

When you can’t calculate gradients directly, you must find other ways to intelligently search for the small input perturbations that cause misclassification. Black-box attacks are fundamentally about inferring the model’s decision boundaries through clever interaction or by exploiting a universal weakness of neural networks. These methods fall into two primary categories: those that rely on querying the model and those that don’t.

Kapcsolati űrlap - EN

Do you have a question about AI Security? Reach out to us here:

Query-Based Attacks: Probing the Oracle

If you can interact with the model, even in a limited way, you can extract enough information to craft an attack. The effectiveness of your attack depends heavily on what the model’s API returns. We split these into two scenarios: score-based and decision-based.

Score-Based (Confidence-Based) Attacks

This is the more privileged black-box scenario. The model’s API returns not just the predicted class label but also a vector of confidence scores (e.g., softmax outputs) for all possible classes. This information is a goldmine. While you don’t have the true gradient, you can approximate it.

The core idea is called Zeroth-Order Optimization. You can estimate the gradient by making tiny changes to the input and observing how the model’s output scores change. By probing the model with slightly different inputs, you can build a numerical estimate of the gradient and then use it in an iterative attack, similar to PGD.


# Pseudocode for estimating the gradient of the loss for a single pixel
def estimate_gradient(model, image, pixel_index, delta=0.01):
    # Create two slightly perturbed images
    image_plus = image.copy()
    image_plus[pixel_index] += delta
    
    image_minus = image.copy()
    image_minus[pixel_index] -= delta
    
    # Query the model to get confidence scores
    score_plus = model.predict(image_plus).target_class_score
    score_minus = model.predict(image_minus).target_class_score
    
    # Estimate the gradient using the central difference formula
    gradient_estimate = (score_plus - score_minus) / (2 * delta)
    
    return gradient_estimate
            

This process is repeated for many different directions to build a full gradient estimate, which then guides the perturbation. It’s computationally expensive due to the high number of queries required but can be highly effective. Attacks like ZOO (Zeroth-Order Optimization) and NES (Natural Evolution Strategies) implement this concept efficiently.

Decision-Based (Hard-Label) Attacks

This is the most restrictive and challenging scenario. The API only returns the final class label (e.g., “cat”). You get no confidence scores. You only know if your input is on one side of a decision boundary or the other. How can you find an adversarial example?

The strategy shifts dramatically. Instead of estimating a gradient, you “walk” along the decision boundary. A common approach, exemplified by the Boundary Attack, works as follows:

  1. Start with an adversarial image: Find any image that the model already misclassifies as the target class. This can often be done by starting with random noise.
  2. Move towards the original image: Iteratively take small steps from this adversarial starting point towards your original, correctly classified image.
  3. Correct if you cross the boundary: After each step, query the model. If the image is now correctly classified (you’ve crossed the boundary), take a small step back into the adversarial region, effectively “hugging” the decision boundary.

By repeating this process, you slowly reduce the distance between the adversarial example and the original image, finding a minimal perturbation that fools the model, all without a single confidence score.

Boundary Attack Illustration Class A Region Class B Region (Adversarial) Decision Boundary Original Image Initial Adversarial Start 1. Move toward original 2. Correction step Final Adversarial Example

Transfer-Based Attacks: The Power of Surrogates

What if you can’t query the model at all, or your query budget is extremely limited? You can turn to one of the most fascinating properties of adversarial examples: transferability. An adversarial example crafted to fool one model has a surprisingly high chance of fooling a completely different model, even one with a different architecture, as long as it was trained on a similar task.

As a red teamer, you exploit this by building a local surrogate model. The process is straightforward:

  1. Build a Substitute: Train your own model (e.g., a ResNet) on a publicly available dataset that mimics the target’s domain (e.g., ImageNet for a general image classifier). For an even stronger attack, you can use the target API to label a dataset for you, effectively “stealing” the model’s knowledge.
  2. Attack the Surrogate: You have full white-box access to your own model. Use a powerful attack like PGD to generate a strong set of adversarial examples against it.
  3. Transfer the Payload: Submit these adversarial examples to the target black-box model. A significant percentage of them will likely succeed without a single gradient estimation query.
Transfer Attack Workflow Attacker’s Surrogate Model 1. Create White-box Attack 2. Craft Adversarial Example Submit to API 3. Transfer Target Black-Box Model Misclassification!

This works because different models trained on similar data distributions tend to learn similar feature representations. An adversarial perturbation that targets a “cat-like” feature in your surrogate model is likely to hit a similar feature in the target model.

Choosing Your Strategy: A Practical Comparison

As a red teamer, your choice of attack depends on the intelligence you have about the target system and the constraints you’re under. This table summarizes the trade-offs.

Attack Type Required Information Query Efficiency Typical Success Rate Primary Use Case
Score-Based API access with confidence scores Low (many queries) High When you have detailed API output and a high query budget.
Decision-Based API access with hard labels only Very Low (most queries) Medium to High The most restrictive online scenario, useful against hardened APIs.
Transfer-Based General knowledge of the model’s domain (e.g., image classification) High (few to zero queries) Medium When queries are expensive, rate-limited, or impossible. Excellent for initial probes.

Key Takeaway

The absence of internal model access is not a robust defense. Black-box attacks demonstrate that motivated adversaries can find weaknesses through intelligent querying or by exploiting the inherent transferability of adversarial examples. For a red teamer, mastering these techniques is essential for testing the real-world resilience of deployed AI systems.