0.7.2 Reverse engineering competitor models and algorithms

2025.10.06.
AI Security Blog

Peeking under the hood of a competitor’s car is classic industrial espionage. In the age of AI, the “engine” is a proprietary model and the “hood” is often a public-facing API. Reverse engineering in this context is not about stealing the physical model file; it’s about deducing the blueprint. The goal is to understand a competitor’s AI architecture, data strategy, and decision-making logic to replicate their success, uncover their weaknesses, or predict their next move.

This process moves beyond the simple theft of intellectual property. It is an active investigation into the *how* and *why* behind a model’s performance. For an industrial spy, successfully reverse-engineering a rival’s flagship model can provide an immense competitive advantage, saving millions in R&D and revealing the “secret sauce” that distinguishes their product in the market.

Kapcsolati űrlap - EN

Do you have a question about AI Security? Reach out to us here:

The World of Limited Access: Black-Box vs. White-Box

Your ability to probe a model depends entirely on your level of access. In corporate espionage, you’ll almost always be operating in a “black-box” scenario, where you can only observe inputs and outputs. Understanding the distinction is fundamental to planning an attack.

Characteristic White-Box Access Black-Box Access
Model Access Full access to model architecture, parameters, weights, and potentially the training code. Query-only access, typically through a public or private API. No internal visibility.
Primary Goal Deep analysis, vulnerability discovery, code review. (Often internal red teaming or academic research). Replicating functionality, inferring architecture, understanding training data. (The typical espionage scenario).
Example Techniques Static code analysis, direct inspection of model weights, gradient analysis. Model extraction, membership inference, side-channel attacks.

Since white-box access implies an insider threat or a major data breach (covered in 0.7.1), our focus here is on the more common and challenging black-box attacks that a competitor would realistically employ.

Core Black-Box Techniques for Model Deconstruction

When faced with only an API endpoint, an attacker must be creative. They treat the model as an oracle, asking it carefully crafted questions to slowly reveal its secrets. The primary techniques fall into several key categories.

Model Extraction (Function Stealing)

The most direct form of reverse engineering is to build your own copy of the target model. This “surrogate” model aims to mimic the input-output behavior of the competitor’s AI as closely as possible. The attacker essentially uses the target API as a labeling machine for their own training dataset.

The process is conceptually simple but resource-intensive:

  1. Data Acquisition: The attacker prepares a large, diverse dataset relevant to the model’s domain (e.g., images for an image classifier, text for a language model).
  2. Querying: The attacker systematically sends this data to the target model’s API and records the outputs (e.g., classifications, predictions, confidence scores).
  3. Training: The attacker uses these input-output pairs to train their own model. The target model’s outputs serve as the “ground truth” labels for training the surrogate.

If successful, the attacker possesses a functional clone of the model without ever having accessed the original.

# Pseudocode for a basic model extraction attack

target_model_api = "https://api.competitor.com/v1/predict"
surrogate_model = initialize_local_model()
query_dataset = load_unlabeled_data("path/to/data")

training_data = []

for input_sample in query_dataset:
    # 1. Query the target API
    response = query(target_model_api, data=input_sample)
    
    # 2. Get the prediction (the "label") from the competitor
    predicted_label = response.get_label()
    
    # 3. Create a training pair for the surrogate model
    training_data.append((input_sample, predicted_label))

# 4. Train the local surrogate model on the stolen labels
surrogate_model.train(training_data)
            

Hyperparameter Stealing via Side Channels

Model extraction replicates *what* the model does, but it doesn’t reveal the underlying architecture—the hyperparameters. These settings (e.g., number of layers, type of activation functions, optimizer used) are a crucial part of the intellectual property. Attackers can infer these details through side-channel attacks, most commonly by analyzing API response times.

The core idea is that different architectural components have different computational costs. A deeper neural network or one using a more complex activation function will, on average, take infinitesimally longer to process a query. By sending a massive number of carefully constructed queries and precisely measuring the inference latency, an attacker can build a statistical profile that hints at the model’s structure.

Timing Attack Concept for Hyperparameter Inference Input Complexity / Size Inference Time (ms) Shallow Model (e.g., 3 Layers) Deep Model (e.g., 10 Layers) Timing Side-Channel Analysis

Membership and Attribute Inference

Beyond the model itself, the training data is a priceless asset. Inference attacks aim to reverse-engineer information about that data by carefully observing the model’s behavior.

Membership Inference

This attack determines if a specific data point was used to train the model. Models often exhibit higher confidence in their predictions for data they have seen before. An attacker can exploit this by:

  1. Querying the target model with a suspected data point (e.g., a specific user’s record).
  2. Observing the output, particularly the confidence score.
  3. Using a separately trained “attack model” that has learned to distinguish the confidence patterns of members vs. non-members.

A successful attack can reveal a competitor’s data sources, confirm if they have access to a particular dataset, or even expose sensitive personal information used in training.

Attribute Inference

Here, the attacker knows some information about a data record and uses the model to infer missing attributes. For example, if a competitor has a model that predicts credit risk, an attacker could input a person’s known details (age, zip code) and observe the model’s output to infer a sensitive attribute the model has learned, such as estimated income. This reverse-engineers the correlations and implicit knowledge baked into the model’s logic.

Defensive Postures: Making Reverse Engineering Prohibitively Expensive

As a red teamer, you must understand the defenses you will face. The goal of the defender isn’t to make these attacks impossible, but to make them so slow, expensive, and unreliable that the attacker gives up.

  • Rate Limiting and Throttling: The simplest defense. Limiting the number of queries an IP address or API key can make per minute severely hampers large-scale extraction attacks. Attackers will counter this with distributed networks of machines.
  • Query Monitoring and Anomaly Detection: Sophisticated defenders don’t just count queries; they analyze their patterns. A systematic sweep of a large dataset for model extraction looks very different from normal user traffic.
  • Output Perturbation: Intentionally adding small amounts of random noise or reducing the precision of the model’s outputs (e.g., rounding confidence scores to two decimal places). This “fuzzing” makes it significantly harder for an attacker to train a high-fidelity surrogate model, as the “labels” they are stealing are less reliable.
  • Watermarking: Embedding a unique, hidden signature into the model’s predictions. This doesn’t prevent theft, but it allows the owner to cryptographically prove that a suspect model is a stolen copy.
  • Differential Privacy: A formal, mathematical framework for training models that provides strong guarantees that the model’s output will not reveal whether any particular individual was in the training data. This is a powerful, but computationally expensive, defense against membership inference attacks.

Ultimately, reverse engineering in the AI space is a cat-and-mouse game. The attacker seeks to ask clever questions, while the defender seeks to make the answers as unhelpful as possible without degrading the service for legitimate users.