29.3.5 Model Conversion Attacks

2025.10.06.
AI Security Blog

The journey from a training environment to a production device is rarely a direct path. Models are frequently converted between formats—from PyTorch to ONNX for interoperability, or from TensorFlow to TFLite for mobile deployment. This conversion step is not a simple file format change; it is a compilation process. And like any compiler, it can be tricked. Model conversion attacks exploit this transformation, embedding malicious logic that is invisible in the source model but activates only in the converted, deployed artifact.

The Conversion Attack Surface

You can’t treat model converters as infallible black boxes. They are complex pieces of software that actively rewrite a model’s computational graph. This rewriting process is the primary attack surface. An attacker who can influence the input to a converter can potentially control its output in unexpected and malicious ways.

Kapcsolati űrlap - EN

Do you have a question about AI Security? Reach out to us here:

  • Operator Translation Ambiguity: Converters map operators from a source framework (e.g., PyTorch’s `nn.Module`) to a target format (e.g., ONNX’s operator set). An attacker can craft a model using obscure or poorly documented operators whose translation logic is flawed or exploitable, allowing for the substitution of a benign operation with a malicious one.
  • Graph Optimization Exploits: To improve performance, converters perform optimizations like operator fusion (merging multiple nodes into one) and constant folding. An attacker can construct a specific, benign-looking sequence of operations that they know the converter will “optimize” into a malicious structure.
  • Custom Operator Hijacking: Many frameworks allow for custom operators. If an attacker can poison a model with a custom operator, they can provide a benign implementation for the training environment and a separate, malicious implementation that gets linked during the conversion to the target deployment format.

Core Attack Techniques

Exploiting the conversion process requires a deep understanding of both the source and target formats, as well as the specific converter tool being used. The most effective attacks are subtle and tailored.

Technique 1: Malicious Operator Grafting

This technique leverages the mechanism for exporting custom operators. The attacker defines a custom operation in the source framework. In its native environment, this operation is inert—it might be an identity function or perform a trivial calculation. However, the attacker also defines a malicious export path for this operator to the target format.

# Attacker defines a custom PyTorch function
class BenignLookingOp(torch.autograd.Function):
    @staticmethod
    def forward(ctx, x):
        # In PyTorch, this op does nothing harmful
        return x.clone()

    @staticmethod
    # This is the critical part: defining the ONNX export behavior
    def symbolic(g, input):
        # Instead of a simple 'Clone', the attacker instructs the
        # converter to insert a custom, malicious operator
        return g.op("MaliciousNamespace::TriggeredDataExfil", input)

# The attacker's model uses this seemingly harmless op
class PoisonedModel(nn.Module):
    def forward(self, x):
        x = self.standard_layer(x)
        x = BenignLookingOp.apply(x) # The poison is planted here
        return self.output_layer(x)

When an unsuspecting developer exports this model to ONNX, the `symbolic` function is called. The converter obediently writes the `MaliciousNamespace::TriggeredDataExfil` operator into the ONNX graph. The source code of `PoisonedModel` looks clean, but the exported artifact now contains a backdoor, waiting for the malicious custom operator to be implemented in the target runtime.

Technique 2: Abusing Graph Optimization

A more sophisticated attack involves creating a “logic bomb” out of standard operators. The attacker doesn’t introduce any custom code. Instead, they assemble a unique pattern of legitimate operators that, while appearing complex and inefficient, is functionally benign in the source framework. However, they have studied the converter’s optimization routines and know this specific pattern will be fused into a single, different operator that has abusable properties.

Source Model (Benign) Pad Slice Concat Input Converter (Operator Fusion) Target Model (Malicious) MaliciousFusedOp Input

Figure 1: A benign sequence of operators is fused by the converter into a single malicious operator.

Red Team Playbook: Targeting the Conversion Pipeline

As a red teamer, your goal is to demonstrate the risk of an untrusted conversion step. This requires moving beyond simple model file analysis and targeting the MLOps process itself.

Phase Red Team Action Defensive Counterpart / Detection Signal
Reconnaissance Identify the exact converter tool and version used in the target’s MLOps pipeline (e.g., `torch.onnx.export` opset 13, `tf2onnx` v1.9.3). Analyze its source code or documentation for exploitable optimization rules or custom op handlers. Maintain a strict Bill of Materials (BOM) for all MLOps tooling. Pin tool versions and audit changes between versions.
Weaponization Craft a source model containing the trigger. This could be a custom operator with a malicious `symbolic` export function or a carefully constructed sequence of standard operators designed to be fused into a malicious one. Code scanning and model analysis tools that can parse and flag the use of unapproved custom operators or recognize known malicious graph patterns.
Delivery Introduce the poisoned source model into the supply chain. This could be via a public model hub like Hugging Face, or by compromising an internal repository. The model appears benign to all standard tests in its source format. Vet all third-party models. Perform behavioral testing in a sandboxed environment before integration.
Execution The attack executes when the target’s automated pipeline ingests the model and runs the conversion tool. The tool itself, acting as intended, builds the malicious payload into the final, deployed artifact. Post-conversion validation is critical. Do not blindly trust the output of a converter. The converted model must be treated as a new, untrusted artifact.
Post-Exploitation The backdoor in the converted model is activated by a specific trigger input in the production environment, leading to data exfiltration, misclassification for denial of service, or other objectives. Perform structural and behavioral analysis on the converted model. Diff the graph structure against the source and investigate unexpected new operators. Run inference tests with a security-focused dataset designed to find triggers.

Key Takeaways

  • The model conversion process is a compilation step, not a file save. It is an active transformation that can be manipulated.
  • Attack payloads can be designed to be completely dormant and undetectable in the source model format, only materializing after conversion.
  • Auditing your MLOps pipeline and the specific conversion tools you use is as important as auditing the models themselves.
  • The most effective defense is a “zero-trust” approach to conversion: always treat the output of a converter as a new, potentially hostile artifact that requires its own independent validation.