Your GPU is Not Your Friend: A Red Teamer’s Guide to Shared Graphics Hardware Security
Let’s get something straight. That shiny, multi-thousand-dollar GPU humming away in your data center isn’t just a number-crunching beast for your AI models. It’s a vault. And inside that vault are your company’s most valuable secrets: your proprietary algorithms, your sensitive training data, your customers’ private information.
And you’ve left the vault door wide open.
For years, we’ve obsessed over CPU security. We’ve built sandboxes, hypervisors, and a dozen layers of defense to keep processes from messing with each other. But the GPU? We’ve treated it like a dumb peripheral, a simple accelerator. We throw multiple jobs, multiple containers, multiple users onto the same physical card and just… hope for the best.
Hope is not a security strategy. It’s a liability.
As red teamers, we love this blissful ignorance. While you’re optimizing your CUDA kernels, we’re figuring out how to make your GPU whisper its secrets to us. And it’s shockingly easy.
The New Battlefield: Why Your GPU is a Gold Mine
Why the sudden panic? Because the entire compute paradigm has tilted. The GPU is no longer just rendering video games or powering CAD software. It has become the heart of modern computing, the engine of the AI revolution. All your most critical workloads—model training, large-scale inference, scientific simulation—run there.
This makes the GPU a ridiculously valuable target. Think about what an attacker gains by compromising a shared GPU:
- Model Theft: They can steal that multi-million dollar LLM you just spent six months training. Not by downloading a file, but by cleverly probing it while it runs, reconstructing its architecture and weights. This is happening.
- Data Exfiltration: They can slurp up sensitive data being processed by other tenants on the same card. Think medical records, financial data, user PII. If it passes through GPU memory (VRAM), it’s potentially visible.
- Compute Hijacking: They can use your expensive hardware for their own purposes, like crypto mining. You get the electricity bill; they get the cash.
- Persistent Access: A compromised GPU driver or firmware is a nightmare. It’s a stealthy, high-privilege foothold deep inside your host system, often invisible to traditional security tools.
The very thing that makes GPUs so powerful—their massively parallel architecture and shared resource design—is what makes them so vulnerable. A CPU is designed like a series of private, locked offices. A GPU is designed like a giant, open-plan factory floor. Every worker (a CUDA core) has access to shared machinery (caches, memory controllers). It’s built for speed and collaboration, not secrecy and isolation.
And you just let a dozen different, untrusted tenants work on that same factory floor, side-by-side.
What could possibly go wrong?
The Shared Tenant Nightmare
The root of all evil in GPU security is multi-tenancy. This is any scenario where more than one “thing” is running on a single physical GPU. These “things” could be different users, different Docker containers, or different virtual machines.
You’re doing this right now. Don’t tell me you dedicate a whole A100 to a single Jupyter notebook. You’re packing tenants onto these cards to maximize utilization and justify their eye-watering cost. It’s the smart financial decision. It’s also a security minefield.
The fundamental problem is the illusion of isolation. You run your job in a Docker container, and you think you’re safe. You’re not. A container is just a clever use of Linux namespaces and cgroups. It isolates the process view, the filesystem, the network stack. It does almost nothing to isolate hardware access. From the GPU’s perspective, your containerized process is just another process running on the host, begging for resources.
Think of a shared GPU like a communal kitchen in a big, trendy co-living space. Your container is your personal cupboard where you keep your ingredients. It feels private. But everyone uses the same stove (the Streaming Multiprocessors), the same sink (the memory bus), and the same limited counter space (the L2 cache).
What happens when your neighbor is a slob? They leave a mess on the counter, making it harder for you to work. That’s a Denial of Service attack. What happens if they’re a thief? They watch you cook, note your secret family recipe, and then steal your truffle oil when you’re not looking. That’s a side-channel attack and data leakage.
This isn’t theoretical. Let’s dive into the specific ways we, the red teamers, turn this shared kitchen into our personal playground.
Attack Vectors: How We Break Your Toys
An attack on a GPU isn’t a single, dramatic event. It’s often a subtle, insidious process. We don’t need to blow the door off its hinges when we can just pick the lock or look through the keyhole.
1. Side-Channel Attacks: The Art of Eavesdropping
This is the classiest, most elegant form of attack. We don’t break any rules. We don’t exploit a software bug. We just… observe.
A side-channel attack is where an attacker gains information by analyzing the physical effects of a system’s operation. It’s not about the what (the data), but the how (how the data is processed). How long did it take? How much power did it draw? Which parts of the hardware were used?
The most fertile ground for this on a GPU is the shared cache.
Here’s a simplified scenario. Imagine an LLM is running on the GPU, processing sensitive user queries. It belongs to Tenant A. We are Tenant B, a seemingly harmless process running on the same GPU. We can’t read Tenant A’s memory directly—the OS and hardware provide some basic protection there.
But we can do this:
- Flood the Cache: We write a simple kernel that fills up a specific portion of the shared L2 cache with our own junk data.
- Wait: We wait for a tiny fraction of a second, allowing Tenant A’s LLM to run and process its data. When it runs, it will inevitably need to use the L2 cache, evicting some of our junk data to make room for its own.
- Measure: We now try to read our junk data back. If a piece of data is still in the cache, reading it is extremely fast. If it was evicted by Tenant A, we have to fetch it from the much slower VRAM. We time how long it takes to read each piece of our data.
- Infer: By seeing which parts of our data were evicted, we can create a map of the cache lines that Tenant A’s process used. This pattern of memory access is a fingerprint. Over thousands of observations, this fingerprint can reveal the structure of the model, the type of operations being performed, and in some cases, even aspects of the data itself.
We haven’t read their data. We’ve just observed the “dents” they left in the shared counter space. And from those dents, we can reconstruct the recipe.
2. Data Leakage and Memory Remnants
This one is less elegant and more brutish. It relies on a simple, dirty truth: memory is not always cleared immediately.
When your process finishes using a chunk of VRAM and “releases” it, what happens? The driver simply marks that memory region as “available.” It doesn’t necessarily go and write zeros over all the old data. That would be slow and inefficient. It waits until another process requests that memory, and then it gets allocated.
This creates a window of opportunity. If an attacker can quickly request a large amount of VRAM right after a victim process has finished, there’s a non-trivial chance they will be allocated memory pages that still contain remnants of the victim’s data. This is called a “memory remnant” attack.
Golden Nugget: “Freeing” memory is a logical concept, not always a physical one. The bits of your secret data can remain magnetically imprinted on the hardware long after your application has exited. An attacker is just a digital archaeologist, sifting through the ruins.
We’ve seen this in the wild. A red team exercise against a financial services company involved one of our operators running a “VRAM scavenger” script on a shared GPU server. They were able to recover fragments of financial models and customer portfolio data from VRAM that had been used by a legitimate, high-priority batch job just moments before.
The fix sounds simple—always zero-out memory on deallocation—but it has a performance cost, which is why it’s often not the default behavior. You’re trading a tiny bit of speed for a gaping security hole.
3. Denial of Service (DoS): The Noisy Neighbor
The least sophisticated but most common problem. A DoS attack on a GPU doesn’t require clever exploits. All it requires is a badly behaved process that hogs shared resources.
An attacker can write a “CUDA hog” kernel that does nothing useful but is specifically designed to saturate a key resource:
- Compute Saturation: Launch a kernel with an infinite loop across all SMs. Legitimate jobs get starved for compute cycles.
- Memory Bandwidth Saturation: Constantly stream huge amounts of data between the GPU and host CPU, or within VRAM, choking the memory bus so other kernels can’t fetch their data.
- PCIe Bus Saturation: Similar to the above, but focused on the link between the GPU and the rest of the system.
This isn’t just a nuisance. In a cloud environment where you pay by the second for GPU time, a DoS attack is a direct financial drain. Your legitimate workloads slow to a crawl, jobs time out, and your bill skyrockets. The attacker hasn’t stolen your data, but they’ve hit you where it hurts: your wallet and your service’s availability.
4. Kernel-Level Exploits: The Game Over Scenario
This is the holy grail. The GPU doesn’t operate in a vacuum. It communicates with the CPU and the rest of the system through a highly complex piece of software called the GPU driver. This driver runs with the highest privileges in the operating system’s kernel (Ring 0).
A bug in the GPU driver is a golden ticket for an attacker. If a malicious user running in a low-privilege container can find a way to send a malformed command or piece of data to the GPU that triggers a bug in the driver, they can often achieve one of two things:
- Crash the entire host machine. (A more advanced DoS).
- Achieve code execution within the kernel. This is the jackpot. From here, they can escape their container, disable all security mechanisms on the host, and take complete control of the server and every other tenant on it.
These aren’t hypothetical. Vulnerabilities are found in NVIDIA and AMD drivers all the time. CVEs are issued. Patches are released. But are you applying them? In the frantic rush to deploy models, driver and firmware updates are often seen as a chore, a risk to stability. For us, that’s a welcome mat.
The Armory: A Defense-in-Depth Strategy
Feeling a bit sick? Good. A healthy dose of paranoia is the first step. Now, let’s talk about building a fortress. You can’t just have one big wall; you need layers of defense, from the silicon all the way up to the application.
1. Hardware-Level Isolation: The Fort Knox Approach
The only real way to solve the “shared kitchen” problem is to give everyone their own private, locked kitchenette. In the GPU world, that means hardware partitioning.
The gold standard here is NVIDIA’s Multi-Instance GPU (MIG) technology, available on their data center cards like the A100 and H100. MIG is not software. It’s not a simulation. It physically slices the GPU into up to seven independent “GPU instances.” Each instance gets its own dedicated:
- Streaming Multiprocessors (the compute engines)
- Slice of the L2 cache
- Memory controllers and VRAM bandwidth
Crucially, it provides fully isolated error and fault containment. If one MIG instance crashes, it doesn’t affect the others. From the OS’s perspective, a single physical GPU with seven MIG instances looks like seven separate, smaller GPUs.
This is the closest you can get to true, secure multi-tenancy. A process running in one MIG instance is physically incapable of launching a cache side-channel attack or a resource-hogging DoS against a process in another instance. The shared resources simply aren’t shared anymore.
The trade-off? Performance. A single MIG instance doesn’t have the power of the full GPU. You’re trading raw, single-job performance for secure, parallel execution. For many inference workloads or smaller training jobs, this is a fantastic trade. For a massive foundation model training run, you’d still want the whole card to yourself.
2. Software and OS-Level Hardening
Hardware isolation is great, but it’s not always available or configured. You still need robust software defenses.
- Virtualization (vGPU): Before MIG, this was the state-of-the-art. Technologies like NVIDIA vGPU work with hypervisors (like VMWare ESXi or KVM) to provide isolated GPU access to full virtual machines. The hypervisor creates a much stronger security boundary than a container does. It’s heavier, has more overhead, and is more complex to manage, but it’s a massive step up from bare-metal containerization.
-
Container Hardening: Don’t abandon the basics!
- Run as non-root: There is almost no reason your ML code needs to run as root inside a container.
- Minimal Base Images: Don’t use
ubuntu:latest. Build your images from scratch or from a minimal base like Distroless. Less software means a smaller attack surface. - Use Security Runtimes: Runtimes like gVisor or Kata Containers provide an extra layer of sandboxing between the container and the host kernel, intercepting system calls. This can help mitigate kernel driver exploits, but be aware of potential performance impacts.
- Patch. Your. Drivers. I can’t say this enough. Set up a regular, disciplined schedule for updating your GPU drivers, firmware, and CUDA toolkit. Read the release notes. They often contain critical security fixes. Yes, it can be a pain. It’s less painful than a breach.
A Quick Word on MPS (Multi-Process Service): You might have heard of MPS as a way to run multiple CUDA applications on a single GPU. MPS IS NOT A SECURITY FEATURE. In fact, it’s the opposite. MPS works by having multiple processes share a single GPU context. This reduces isolation and makes it even easier for processes to interfere with one another. It’s a performance tool for trusted processes, not a security tool for untrusted tenants. Don’t confuse the two.
3. Monitoring and Anomaly Detection: The Security Cameras
You can’t protect what you can’t see. If your only interaction with your GPU is firing off a job and waiting for it to finish, you’re flying blind. You need to be actively monitoring its vitals.
Start with the basics. The nvidia-smi command-line tool is your best friend. But for serious, automated monitoring, you need to use something like NVIDIA’s Data Center GPU Manager (DCGM). DCGM exposes hundreds of metrics that you can scrape with tools like Prometheus and visualize in Grafana.
What should you be looking for? You’re hunting for anomalies—deviations from the baseline.
| Metric to Monitor | What a Potential Attack Looks Like | Why It’s Suspicious |
|---|---|---|
GPU Utilization |
Sustained 100% utilization when you expect idle periods or bursty workloads. | Could be a “CUDA hog” DoS kernel or unauthorized crypto mining. |
Memory Utilization |
A process rapidly allocating and deallocating huge chunks of VRAM. | This is the classic pattern of a “memory remnant” scavenger trying to grab memory pages before they’re cleared. |
PCIe Bandwidth |
Unusually high, sustained traffic on the PCIe bus, especially when the GPU itself isn’t under heavy compute load. | Could indicate data exfiltration, where an attacker is streaming data from VRAM back to the host CPU and out to the network. |
L2 Cache Hit/Miss Rate |
A sudden, drastic drop in the cache hit rate for your legitimate application. | This is a tell-tale sign of a cache contention or side-channel attack, where another process is constantly flushing your useful data out of the cache. |
Xid Errors |
Any non-zero count of these critical error events. | Xid errors indicate something has gone seriously wrong at the driver or hardware level. It could be a hardware fault, but it could also be a symptom of a malicious process sending malformed commands to trigger a driver bug. Investigate these immediately! |
Setting up alerting on these metrics is crucial. You should get a Slack message or a PagerDuty alert the moment a GPU’s behavior deviates significantly from its normal operating profile.
Putting It All Together: A Practical Decision Framework
So, what should you use? It depends entirely on your threat model. Who are your “tenants,” and how much do you trust them?
| Isolation Level | Key Technology | Pros | Cons | Best For… |
|---|---|---|---|---|
| None (Bare Metal) | Direct CUDA Access | Maximum performance, simplicity. | Zero isolation. Extremely high risk. | A single, dedicated server for one trusted job (e.g., foundation model training). Not for sharing. Ever. |
| Weak (Process-level) | Standard Docker Containers | Easy to use, low overhead, ecosystem support. | Provides a false sense of security. Does not isolate hardware resources. | Running multiple processes of the same trusted application that need to cooperate. Unsuitable for untrusted tenants. |
| Moderate (Hypervisor) | NVIDIA vGPU + KVM/ESXi | Strong security boundary via the hypervisor. Mature technology. | Significant performance overhead. Complex to set up and manage. | Enterprise environments with internal teams who need isolated VDI or compute instances. A good middle-ground. |
| Strong (Hardware) | NVIDIA MIG / AMD SR-IOV | “Gold Standard” isolation. Hardened in silicon. Low performance overhead for the isolation provided. | Reduces the max size of a single job. Requires specific, expensive hardware. | Public cloud providers, large enterprises, and any zero-trust environment where you are running code from untrusted customers or users. |
The Final Word
For a decade, we’ve been building skyscrapers of software on a foundation of sand. We’ve assumed the hardware underneath was a simple, trustworthy utility. That assumption is now dangerously obsolete.
The attacks are no longer theoretical. They are practical, they are happening, and they are becoming more sophisticated every day. As an industry, we are just beginning to wake up to the reality of hardware security in the AI age.
Stop treating your GPUs like disposable compute resources. Start treating them like the high-security vaults they are. The barrier to entry for these attacks is dropping, and the value of the assets you’re protecting is skyrocketing. That’s a dangerous combination.
Your GPU isn’t just running code; it’s guarding the crown jewels. Treat it that way.