In AI red teaming, your most constrained resource is often time. The speed at which you can generate adversarial examples, probe model defenses, or fine-tune a surrogate model directly impacts the scope and depth of your engagement. An underutilized GPU is more than just an inefficient piece of hardware; it’s a critical operational bottleneck that slows down your entire testing cycle. Maximizing its throughput is not just about performance—it’s about increasing your capacity to discover vulnerabilities.
Think of your GPU as a high-performance engine. If you only feed it fuel in small, intermittent bursts, it will spend most of its time idling, never reaching its potential. Our goal is to create a continuous, high-volume pipeline of data and computation that keeps this engine running at full throttle.
Diagnosing the Bottleneck: Is Your GPU Actually Working?
Before you can fix a problem, you must identify it. The most fundamental tool for monitoring NVIDIA GPUs is the NVIDIA System Management Interface (nvidia-smi). Running this command in your terminal while your attack script is active provides a real-time snapshot of your hardware’s state.
$ nvidia-smi -l 1 # Refresh every 1 second
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 515.65.01 Driver Version: 515.65.01 CUDA Version: 11.7 |
|-------------------------------+----------------------+----------------------+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute-M |
|===============================+======================+======================|
| 0 NVIDIA A100-SXM... On | 00000000:07:00.0 Off | 0 |
| N/A 35C P0 55W / 400W | 1527MiB / 40960MiB | 15% Default |
+-------------------------------+----------------------+----------------------+
Focus on two key metrics:
- Memory-Usage: High memory usage shows that your model and data are loaded onto the GPU, which is a good start. However, this alone is not an indicator of performance.
- GPU-Util: This is the percentage of time one or more kernels were executing on the GPU. Consistently low utilization (e.g., < 50%) during computationally intensive parts of your code is a clear sign of a bottleneck elsewhere in your pipeline.
If you see high memory usage but low GPU utilization, it’s a classic symptom of the GPU waiting for data from the CPU. The GPU is ready to work, but it’s being starved.
Common Causes of GPU Starvation
An idle GPU is usually a symptom of a CPU-bound process. The CPU is responsible for loading data, preprocessing it, and sending it to the GPU. If any of these steps are slower than the GPU’s computation time, the GPU will be forced to wait.
Visualization of GPU utilization. The inefficient pipeline shows the GPU idling while the CPU prepares data. The efficient pipeline overlaps these tasks, keeping the GPU continuously busy.
| Bottleneck | Description | Symptom |
|---|---|---|
| Data I/O | Reading data from disk (e.g., images, text files) is slow and blocks the main process. | GPU utilization spikes and drops, with periods of 0% activity between batches. |
| CPU-Bound Preprocessing | Complex transformations (e.g., resizing, tokenization, augmentations) on the CPU take longer than the GPU forward pass. | Consistently low but non-zero GPU utilization. High CPU usage on one or more cores. |
| Small Batch Sizes | The overhead of launching a computation kernel on the GPU is significant. With very small batches, the GPU finishes work faster than the next batch can be prepared. | Low utilization, even with fast data loading. This is particularly relevant for inference tasks. |
Strategies for Keeping the GPU Fed
1. Asynchronous Data Loading
The most effective strategy is to parallelize data loading and preprocessing. Use dedicated worker processes that prepare batches on the CPU in the background while the GPU is busy computing the current batch. Both PyTorch and TensorFlow provide high-level abstractions for this.
In PyTorch, this is handled by the DataLoader class. Setting num_workers > 0 spawns separate processes for data loading. Setting pin_memory=True allows for faster data transfer to the GPU by staging it in a special “pinned” region of CPU memory.
# PyTorch DataLoader Optimization
import torch
from torch.utils.data import DataLoader, TensorDataset
# Assume 'dataset' is your collection of data
dataset = TensorDataset(torch.randn(1000, 3, 32, 32), torch.randint(0, 10, (1000,)))
# Use multiple worker processes and pinned memory
data_loader = DataLoader(
dataset,
batch_size=256,
shuffle=True,
num_workers=8, # Key: Use multiple CPU cores for loading
pin_memory=True # Key: Speeds up CPU-to-GPU memory copies
)
# Your loop now gets data without waiting for it to be loaded from disk
for data, labels in data_loader:
data, labels = data.to('cuda'), labels.to('cuda')
# ... perform your attack or model query ...
2. Mixed-Precision Computation
By default, neural network computations use 32-bit floating-point numbers (FP32). Using 16-bit floats (FP16 or BF16) can provide significant speedups. It reduces the memory footprint of your models and data by half, which lessens the memory bandwidth bottleneck. Modern GPUs have specialized Tensor Cores that accelerate 16-bit matrix multiplications dramatically.
For many red teaming tasks like generating adversarial examples with methods like FGSM or PGD, the reduced precision has a negligible impact on attack effectiveness but can nearly double your throughput.
# PyTorch Automatic Mixed Precision (AMP)
import torch
scaler = torch.cuda.amp.GradScaler() # Needed for training, optional for inference
model = MyModel().to('cuda')
optimizer = torch.optim.Adam(model.parameters())
for data, labels in data_loader:
data, labels = data.to('cuda'), labels.to('cuda')
# autocast context manager runs operations in mixed precision
with torch.cuda.amp.autocast():
outputs = model(data)
loss = loss_function(outputs, labels)
# scaler is used for backpropagation to prevent underflow
scaler.scale(loss).backward()
scaler.step(optimizer)
scaler.update()
For inference-based attacks, you can often simplify this by just using the autocast context manager around your model’s forward pass.
3. Just-In-Time (JIT) Compilation
Modern deep learning frameworks can analyze your model’s computational graph and optimize it. A key optimization is “kernel fusion,” where multiple small operations (e.g., a convolution, then a bias add, then a ReLU activation) are fused into a single, more efficient GPU kernel. This reduces the overhead of launching many separate kernels.
In PyTorch 2.0+, this is incredibly simple with torch.compile(). In TensorFlow, this is handled by AutoGraph and the XLA (Accelerated Linear Algebra) compiler.
# PyTorch 2.0+ JIT Compilation
import torch
model = MyModel().to('cuda')
# This one line can provide significant speedups for complex models
optimized_model = torch.compile(model)
# Use the optimized model as you normally would
output = optimized_model(input_tensor.to('cuda'))
Applying a JIT compiler is often the easiest optimization to try. The first run might be slow as the model is compiled, but subsequent runs will be significantly faster, directly improving your attack generation loop.
By combining these techniques—ensuring your data pipeline is parallel, reducing memory pressure with mixed precision, and fusing operations with a JIT compiler—you transform your workflow from a series of stop-and-go tasks into a highly efficient, parallel assembly line. This efficiency translates directly into more tests, deeper analysis, and a more thorough red teaming engagement.