6.4.2. Batch processing strategies

2025.10.06.
AI Security Blog

Generating a single adversarial example is an interesting proof of concept. Generating ten thousand is how you find systemic vulnerabilities. The biggest barrier between these two scenarios is often not the attack algorithm itself, but the sheer inefficiency of processing inputs one by one. Batching is the fundamental technique you’ll use to bridge this gap, turning a slow, sequential process into a high-throughput pipeline that can properly leverage your hardware.

The Core Concept: From One to Many

At its heart, batch processing is simple: instead of feeding a single input to a model, you group multiple inputs into a single “batch” or “tensor” and process them all in one pass. This seemingly minor change has a profound impact on performance because it aligns with how modern hardware, especially GPUs, is designed to work.

Kapcsolati űrlap - EN

Do you have a question about AI Security? Reach out to us here:

Think of it as the difference between a car wash that services one vehicle at a time from start to finish versus a tunnel wash that moves a continuous line of cars through different stages simultaneously. The latter achieves far greater throughput by parallelizing the work.

Single Processing (Inefficient) Model Input 1 Output 1 Batch Processing (Efficient) Model Input 1, 2, 3… Output 1, 2, 3…

Why Batching is a Red Teamer’s Force Multiplier

For a red teamer, efficient tooling is non-negotiable. Wasting hours on slow attack generation means fewer targets tested and fewer vulnerabilities found. Batching directly addresses this by:

  • Maximizing Throughput: The overhead of launching a computation on a GPU (a “kernel launch”) is significant. By processing a large batch, you pay this cost once for many inputs, amortizing it to near zero per sample. This is the primary driver behind the 10x-100x speedups often seen with batching.
  • Saturating Hardware: As discussed in the previous chapter, a GPU has thousands of cores. Sending one input is like using a single lane on a 128-lane highway. Batching provides the parallel workload needed to fill those lanes and make full use of the hardware you have.
  • Enabling Large-Scale Attacks: Many advanced adversarial attacks, such as those that search for universal adversarial perturbations or perform broad model fingerprinting, are computationally infeasible without batching. It’s the enabling technology that scales your efforts from single exploits to systemic analysis.

Practical Batching Strategies

The right batching strategy depends on your use case. Are you processing a static dataset, or are you building an interactive tool that receives inputs on the fly?

Static Batching: The Workhorse

This is the most common scenario. You have a predefined set of inputs (e.g., a validation dataset of 10,000 images) and want to process them as quickly as possible. The strategy is to divide the dataset into fixed-size chunks.

# Pseudocode for static batching
dataset = load_all_inputs() # e.g., 10,000 images
batch_size = 64

for i in range(0, len(dataset), batch_size):
    # Slice the dataset to create a batch
    batch_inputs = dataset[i : i + batch_size]

    # Preprocess and convert the batch to a single tensor
    input_tensor = preprocess(batch_inputs)

    # Run the model on the entire batch at once
    outputs = model.predict(input_tensor)

    # Process the batch of results
    process_results(outputs)

The main decision here is choosing the batch_size. A larger size typically improves throughput, but it also consumes more GPU memory. This is a critical trade-off we’ll explore further.

Dynamic Batching: For Interactive Scenarios

Imagine you’re building a tool to test a live API endpoint. Inputs arrive unpredictably. If you process each one instantly, you get low latency but terrible throughput. Dynamic batching offers a compromise: you collect incoming requests for a short time and then process them together.

This introduces a latency vs. throughput trade-off. A longer wait time allows for bigger, more efficient batches but makes your tool less responsive. This is a common pattern in production ML inference servers, and it’s equally valuable for building high-performance red teaming tools.

Aspect Static Batching Dynamic Batching
Use Case Offline processing of a known dataset Live, interactive systems; API testing
Primary Goal Maximize total throughput Balance throughput and latency
Implementation Simple loop over a dataset slice Requires a queuing mechanism and a timer/size trigger
Key Parameter batch_size max_batch_size and wait_timeout

Handling Variable Inputs: Padding and Masking

Batching requires all inputs in a tensor to have the same shape. This is easy for fixed-size images but a major challenge for variable-length data like text. The standard solution is padding.

You identify the longest sequence in a batch and pad all shorter sequences with a special token (e.g., a `[PAD]` token with an ID of 0) until they match that length. However, you must also provide an attention mask—a binary tensor that tells the model which tokens are real (1) and which are padding (0)—to prevent the model from processing the meaningless padding.

# Example of padding for NLP in a batch
sentences = ["test prompt", "a much longer test prompt"]

# Tokenizer handles padding and masking automatically
tokenizer_output = tokenizer(
    sentences,
    padding="longest", # Pad to the length of the longest sequence
    return_tensors="pt" # Return PyTorch tensors
)

# input_ids will be padded with 0s
# [[101, 4234, 12213, 102, 0, 0, 0],
#  [101, 1037, 2172, 3014, 4234, 12213, 102]]
input_ids = tokenizer_output["input_ids"]

# attention_mask tells the model what to ignore
# [[1, 1, 1, 1, 0, 0, 0],
#  [1, 1, 1, 1, 1, 1, 1]]
attention_mask = tokenizer_output["attention_mask"]

Finding Your Optimal Batch Size

The “best” batch size is not a universal constant. It’s an empirical value determined by your specific hardware (GPU VRAM), model architecture, and the nature of your task. Finding it is a straightforward process of experimentation:

  1. Start Small: Begin with a conservative batch size, like 8, 16, or 32.
  2. Measure Performance: Run your attack generation script and measure the throughput (e.g., in samples per second).
  3. Increase and Repeat: Double the batch size and measure again. Keep a close watch on your GPU memory usage using tools like `nvidia-smi`.
  4. Find the Limit: You will eventually hit one of two limits:
    • An Out of Memory (OOM) error, which means the batch is too large to fit in VRAM.
    • A performance plateau, where increasing the batch size no longer improves throughput because some other part of your system (like data loading) has become the bottleneck.

Your optimal batch size is typically the largest one you can use before hitting the OOM error or the performance plateau. For gradient-based attacks, remember that you also need memory for storing gradients, so the maximum batch size will be smaller than for simple inference.

Throughput vs. Batch Size Batch Size Throughput 16 32 64 128 256 OOM Error Optimal Point

Key Takeaways for Red Teamers

  • Batching is non-negotiable for scale. Move beyond single-input scripts to unlock high-throughput testing.
  • Match the strategy to the task. Use static batching for offline analysis and consider dynamic batching for interactive tools.
  • Master padding and masking. They are essential for batching variable-length inputs, a common case in NLP and other domains.
  • Find your optimal batch size empirically. It is the single most important parameter for tuning performance, and it’s unique to your hardware and model combination.