Monitor pods on GPU-accelerated node in Kubernetes cluster and update nodes with chassis and GPU labels serial numbers. Supports serial number export to various state backends for tracking, monitoring, and analyses.
GPU accelerated Kubernetes nodes in operator managed services (e.g. EKS in AWS or GKE in GCP) are ephemeral VMs that can run on top of physical hosts which change over time. Multiple VPs over time may run on a single physical host, so to ensure break-fix context of these nodes it's crucial to:
- Track GPU health and utilization across physical hardware
- Correlate GPU performance issues with specific hardware units
- Maintain audit trails for GPU resource allocation
- Monitor GPU lifecycle in multi-tenant environments
gpuid provides a lightweight, scalable solution for GPU inventory management in Kubernetes clusters.
- HTTP, PostgreSQL DB, and S3 exporters
- Connection pooling, retry logic, health checks
- Structured logging with contextual information
- Prometheus-compatible observability metrics for monitoring
- SLSA build attestation and Sigstore attestation validation
- Node labels with the GPU and chassis serial numbers
# H100 (no chassis):
gpuid.github.com/gpu-0=1652823054567
gpuid.github.com/gpu-1=1652823055642
gpuid.github.com/gpu-2=1652823055647
gpuid.github.com/gpu-3=1652823055931
gpuid.github.com/gpu-4=1652923033989
gpuid.github.com/gpu-5=1652923034028
gpuid.github.com/gpu-6=1652923034291
gpuid.github.com/gpu-7=1653023018213
# GB200:
gpuid.github.com/chassis=1821325191344
gpuid.github.com/gpu-0=1761025346025
gpuid.github.com/gpu-1=1761125340419GB200 nodes have 4 GPUs but only 2 unique serial numbers. These GPUs come in dual-die packaging where 2 GPU are stitched together with NVLink-C2C on the same module.
gpuid supports multiple data export backends:
- StdOut: Development and debugging (default)
- HTTP: POSTs to HTTP endpoints
- PostgreSQL: Batch inserts into PostgreSQL database
- S3: Puts CSV object into S3-compatible bucket
Type: stdout (default)
Purpose: Development and debugging, outputs JSON to stdout
Configuration: No additional environment variables required
env:
- name: CLUSTER_NAME
value: 'validation'Type: http
Purpose: Send GPU data to HTTP endpoints via POST requests
Features: Bearer token authentication, configurable timeouts, automatic retries
env:
- name: EXPORTER_TYPE
value: 'http'
- name: CLUSTER_NAME
value: 'validation'
- name: HTTP_ENDPOINT
value: 'https://api.example.com/gpu-data'
- name: HTTP_TIMEOUT
value: '30s'
- name: HTTP_AUTH_TOKEN
valueFrom:
secretKeyRef:
name: http-credentials
key: tokenType: postgres
Purpose: Database storage with full ACID compliance
Features: Connection pooling, automatic schema management, batch processing
env:
- name: EXPORTER_TYPE
value: 'postgres'
- name: CLUSTER_NAME
value: 'validation'
- name: POSTGRES_PORT
value: '5432'
- name: POSTGRES_DB
value: 'gpuid'
- name: POSTGRES_TABLE
value: 'serials'
- name: POSTGRES_HOST
valueFrom:
secretKeyRef:
name: db-credentials
key: host
- name: POSTGRES_USER
valueFrom:
secretKeyRef:
name: db-credentials
key: username
- name: POSTGRES_PASSWORD
valueFrom:
secretKeyRef:
name: db-credentials
key: passwordType: s3
Purpose: Cloud storage with time-based partitioning
Features: Automatic partitioning, batch uploads, configurable prefixes
env:
- name: EXPORTER_TYPE
value: 's3'
- name: CLUSTER_NAME
value: 'validation'
# GPU Serial Number Provider
- name: NAMESPACE
value: 'gpu-operator'
- name: LABEL_SELECTOR
value: 'app=nvidia-device-plugin-daemonset'
# S3 Exporter Configuration
- name: S3_BUCKET
value: 'gpuids'
- name: S3_PREFIX
value: 'serial-numbers'
- name: S3_REGION
value: 'us-east-1'
- name: S3_PARTITION_PATTERN
value: 'year=%Y/month=%m/day=%d/hour=%H'
# AWS Credentials from Kubernetes Secret
- name: AWS_ACCESS_KEY_ID
valueFrom:
secretKeyRef:
name: s3-credentials
key: AWS_ACCESS_KEY_ID
- name: AWS_SECRET_ACCESS_KEY
valueFrom:
secretKeyRef:
name: s3-credentials
key: AWS_SECRET_ACCESS_KEYDownload and expand either the zip or tar version of the artifacts gpuid and policy artifacts from https://github.com/mchmarny/gpuid/releases/latest.
- Configure the deployment by updating the specific overlay that corresponds to your backend type:
stdout(default) - deployments/gpuid/overlays/stdout/patch-deployment.yamlhttp- deployments/gpuid/overlays/http/patch-deployment.yamlpostgres- deployments/gpuid/overlays/postgres/patch-deployment.yamls3- deployments/gpuid/overlays/s3/patch-deployment.yaml
- Apply the configuration
Substitute for the desired backend.
kubectl apply -k deployments/gpuid/overlays/stdout- Verify deployment
Make sure the exporter pod is running:
kubectl -n gpuid get pods -l app=gpuidAnd review its logs:
kubectl -n gpuid logs -l app=gpuid --tail=-1gpuid emits structured logs in JSON format with contextual information:
Since these logs are in JSON, you can filter them with jq for specific information, for example, error events:
kubectl -n gpuid logs -l app=gpuid --tail=-1 \
| jq -r 'select(.level == "ERROR") | "\(.time) \(.msg) \(.error)"'Or only the serial reading events:
kubectl -n gpuid logs -l app=gpuid --tail=-1 \
| jq -r 'select(.msg == "gpu serial number reading")
| "\(.chassis) \(.node) \(.machine) \(.gpu)"'Once deployed, you can use these new labels:
kubectl get nodes -l nodeGroup=customer-gpu -o json \
| jq -r '
[ .items[]
| {chassis: (.metadata.labels["gpuid.github.com/chassis"] // "na")}
]
| group_by(.chassis)
| map({(.[0].chassis): length})
| add
'
{
"1821025191506": 9,
"1821225190819": 7,
"1821225192095": 9,
"1821325191344": 9
}kubectl delete -k deployments/gpuid/overlays/s3The gpuid service exposes Prometheus-compatible metrics on the :8080/metrics endpoint:
gpuid_export_success_total{exporter_type, node, pod}: Successful export operationsgpuid_export_failure_total{exporter_type, node, pod, error_type}: Failed export operations
GPU serial number readings are exported in a consistent schema across all backends:
cluster: Kubernetes cluster identifier where the GPUs were observednode: Kubernetes node name where GPU was discoveredmachine: VM instance ID or physical machine identifiersource: Namespace/Pod name that provided the GPU informationgpu: GPU serial number from nvidia-smiread_time: Timestamp when the reading was taken (RFC3339 format)
When using HTTP exporter, the content includes the JSON serialized record:
{
"cluster": "production-cluster",
"node": "gpu-node-01",
"machine": "i-1234567890abcdef0",
"source": "gpu-operator/nvidia-device-plugin-abc123",
"gpu": "1234567890",
"time": "2025-09-10T10:30:45Z"
}When using the PostgreSQL exporter, data is stored in the following table structure:
CREATE TABLE serials (
id BIGSERIAL PRIMARY KEY,
cluster VARCHAR(255) NOT NULL,
node VARCHAR(255) NOT NULL,
machine VARCHAR(255) NOT NULL,
source VARCHAR(255) NOT NULL,
chassis VARCHAR(255) NOT NULL,
gpu VARCHAR(255) NOT NULL,
read_time TIMESTAMP WITH TIME ZONE NOT NULL,
created_at TIMESTAMP WITH TIME ZONE NOT NULL DEFAULT NOW(),
UNIQUE(cluster, node, machine, source, chassis, gpu, read_time)
);
-- Optimized indexes for common query patterns
CREATE INDEX idx_serials_cluster ON serials (cluster);
CREATE INDEX idx_serials_node ON serials (node);
CREATE INDEX idx_serials_read_time ON serials (read_time);
CREATE INDEX idx_serials_created_at ON serials (created_at);Few queries:
GPUs which have been used in more than 1 machine:
SELECT
gpu,
COUNT(DISTINCT machine) AS machines_per_gpu
FROM serials
GROUP BY gpu
HAVING COUNT(DISTINCT machine) > 1
ORDER BY gpu;GPUs that moved across clusters:
SELECT
gpu,
COUNT(DISTINCT cluster) AS clusters_seen_in
FROM serials
GROUP BY gpu
HAVING COUNT(DISTINCT cluster) > 1
ORDER BY clusters_seen_in DESC;Number of GPUs per day:
SELECT
DATE(read_time) AS day,
COUNT(DISTINCT gpu) AS unique_gpus
FROM serials
GROUP BY day
ORDER BY day;The S3 exporter organizes data with time-based partitioning:
s3://bucket-name/prefix/
├── year=2025/month=09/day=10/hour=10/
│ ├── cluster=prod/node=gpu-node-01/20250910-103045-uuid.json
│ └── cluster=prod/node=gpu-node-02/20250910-103112-uuid.json
└── year=2025/month=09/day=10/hour=11/
└── cluster=prod/node=gpu-node-01/20250910-110215-uuid.json
The gpuid container images are built with SLSA (Supply-chain Levels for Software Artifacts).
Navigate to https://github.com/mchmarny/gpuid/attestations and pick the version you want to verify. The subject digest at the bottom should match the digest of the image you are deploying.
Update the below image with the digest to the version you end up using.
export IMAGE=ghcr.io/mchmarny/gpuid:latestTo verify the attestation on this image using GitHub CLI:
gh attestation verify "oci://$IMAGE" \
--repo mchmarny/gpuid \
--predicate-type https://slsa.dev/provenance/v1 \
--limit 1cosign verify-attestation \
--type https://slsa.dev/provenance/v1 \
--certificate-github-workflow-repository 'mchmarny/gpuid' \
--certificate-identity-regexp 'https://github.com/mchmarny/gpuid/*' \
--certificate-oidc-issuer 'https://token.actions.githubusercontent.com' \
$IMAGETo ensure only verified images are deployed in your cluster:
- Install Sigstore Policy Controller (if not already installed):
kubectl create namespace cosign-system
helm repo add sigstore https://sigstore.github.io/helm-charts
helm repo update
helm install policy-controller -n cosign-system sigstore/policy-controller- Enable Sigstore policy validation:
kubectl label namespace gpuid policy.sigstore.dev/include=true- Apply the image policy:
kubectl apply -f deployments/policy/slsa-attestation.yaml- Test the admission policy:
kubectl -n gpuid run test --image=$IMAGEThis is my personal project and it does not represent my employer. While I do my best to ensure that everything works, I take no responsibility for issues caused by this code.