Running GPU workloads on Kubernetes is a bit like sharing a single power outlet among fifteen roommates who all need to charge their phones at the same time. Someone always ends up with a dead battery. HAMi, the CNCF sandbox project formerly known as HAMi, exists to solve exactly this problem. It lets multiple pods share the same GPU, carving up memory and compute so your expensive hardware does not sit idle at 3% utilization while three teams wait in a queue.
HAMi v2.9.0 just shipped, and it is a big one. This release adds production-ready support for Kubernetes Dynamic Resource Allocation (DRA) with NVIDIA GPUs, introduces a brand-new HAMi-core mode for Huawei Ascend NPUs, and expands to Vast.ai as a supported device provider. If your cluster runs AI workloads and you are not sharing GPUs yet, this is the release that makes the case impossible to ignore.
What is HAMi, and Why Should You Care?
HAMi is a GPU sharing and virtualization layer for Kubernetes. Think of it as a smart bouncer for your GPUs. Instead of one pod hogging an entire A100, HAMi splits the GPU into virtual slices so multiple workloads can run simultaneously without stepping on each other. It supports NVIDIA, AMD, Huawei Ascend, Hygon DCU, Iluvatar, and Moore Threads devices, and it hooks into the Kubernetes scheduler to make sure the right pods land on the right hardware.
The problem it solves: GPUs are absurdly expensive and most workloads use a fraction of their capacity. Without GPU sharing, you are paying for a Ferrari and driving it to the mailbox. HAMi lets you pack more workloads onto the same hardware, cutting costs and reducing wait times.
Who uses it: Platform engineers and SREs running ML training, inference, and GPU-accelerated workloads in Kubernetes. Anyone who has ever looked at their GPU utilization dashboard and cried.
What is New in v2.9.0
HAMi-DRA for NVIDIA Is Production-Ready
Dynamic Resource Allocation (DRA) is Kubernetes’ next-generation device management API, designed to replace the limited built-in device plugin mechanism. HAMi-DRA for NVIDIA has been in development for a while, and v2.9.0 marks it as ready for production use.
This is a big deal because DRA is the future of how Kubernetes handles specialized hardware. Instead of the old model where device plugins advertise resources and the scheduler just picks a node, DRA gives the scheduler much richer information about what devices are available, their topology, and how they can be composed together. HAMi-DRA bridges GPU sharing with this modern API.
# DRA-based GPU sharing in a pod spec
apiVersion: v1
kind: Pod
metadata:
name: gpu-workload
spec:
containers:
- name: app
image: pytorch/pytorch:latest
resources:
claims:
- name: gpu-share
resourceClaims:
- name: gpu-share
template:
spec:
devices:
requests:
- name: hami-gpu
deviceClassName: gpu.nvidia.com
PR #1845 bumps HAMi-DRA to v0.2.0 as part of this release.
HAMi-Core Mode for Huawei Ascend Devices
If your infrastructure runs on Huawei Ascend NPUs, this is the headline feature. HAMi v2.9.0 introduces HAMi-core mode for Ascend devices, bringing the same GPU virtualization capabilities that NVIDIA users have enjoyed to the Ascend ecosystem.
This is not just a basic port. The release adds a full virtualization layer with vnpu-core support (PR #1771), enabling fine-grained core-level slicing of Ascend NPUs. You can now define custom resource names for Ascend cores (PR #1804), filter nodes based on vnpu-core annotations (PR #1812), and request multiple devices with vnpu-core enabled (PR #1837).
For SuperPod environments running Ascend 910C devices, there is even support for module-pair allocation (PR #1610), which handles the specific topology requirements of large-scale Ascend deployments.
Vast.ai Device Support
Vast.ai is a GPU cloud marketplace that lets you rent GPU capacity from individual hosts around the world at prices that make cloud providers look like luxury hotels. HAMi v2.9.0 now supports Vast.ai as a first-class device provider (PR #1645), which means you can run HAMi’s GPU sharing on Vast.ai instances just like you would on your own bare metal.
For teams experimenting with burst GPU capacity or running cost-optimized inference workloads, this integration opens up a new world of options.
Better Observability with Prometheus
What good is GPU sharing if you cannot see what is happening? HAMi v2.9.0 adds Prometheus ServiceMonitor support for both the Helm chart (PR #1614) and the device plugin (PR #1633), along with metric and label names aligned to Prometheus best practices (PR #1644).
The release also adds a vGPUmonitor metrics-bind-address flag (PR #1613) so you can control where metrics are exposed, and a vgpu_metrics_summarizer skill (PR #1755) for quick diagnostics.
Resource Quota Enforcement in Webhooks
Previously, HAMi’s webhook validated device requests but did not check whether the requesting namespace actually had enough quota. v2.9.0 fixes this with resource quota checking directly in the webhook (PR #1605). Now, when a pod requests GPU resources that would exceed the namespace’s quota, the webhook rejects it immediately instead of letting it through to fail at scheduling time. Less mystery, fewer support tickets.
CDI Support and Volcano Integration
The Container Device Interface (CDI) is emerging as the standard way for runtimes to discover and configure devices. HAMi v2.9.0 syncs the Volcano vGPU device plugin with version 0.19 and adds CDI support, bringing HAMi in line with the direction container runtimes are heading.
Local Deploy for Minikube and Kind
Trying HAMi used to require a real cluster with real GPUs. Not anymore. A new local-deploy target (PR #1760) lets you spin up HAMi on minikube or kind clusters for testing and development. This dramatically lowers the barrier to entry for anyone curious about GPU sharing but without access to GPU-equipped infrastructure.
Bug Fixes
The release also includes 33+ bug fixes. A few notable ones:
- vLLM tensor parallelism fix — Initialization errors when using tensor parallelism on vLLM versions newer than 0.18 are now resolved.
- Scheduler slot prediction — Corrected slot usage prediction and added device type filtering (PR #1700).
- Multi-container init containers — Fixed device allocation for multi-container pods with init containers (PR #1650).
- MIG with CDI mode — Allocation failures when using MIG in CDI mode are now fixed (PR #1826).
- Kernel 6.17 handshake — Edge cases in NVIDIA health checks on Kernel 6.17 are now handled (PR #1810).
- Stale handshake recovery — Scheduling now recovers on nodes with stale Deleted_ handshake state (PR #1843).



