Skip to content

Conversation

everzakov
Copy link

Add the vTPM (virtual Trusted Platform Module) specification to the documentation, config.go and schema description. Runtime uses this specification to create vTPMs and pass them to the container. This virtual module can be used to create quotes, signatures and perform Direct Anonymous Attestation.

Also, users can specify that vTPM should be manufactured with precreated certs / activated PCR banks and populated Endorsement Key Pair.

The following is an example of a vTPM description that is found under the path /linux/resources/vtpms:

    "vtpms": [
        {
            "statePath": "/var/lib/runc/myvtpm1",
            "statePathIsManaged": false,
            "vtpmVersion": "2",
            "createCerts": false,
            "runAs": "tss",
            "pcrBanks": "sha1,sha512",
            "encryptionPassword": "mysecret",
            "vtpmName": "tpm0",
            "vtpmMajor": 100,
            "vtpmMinor": 1
        }
    ]

This PR is based on #920

Add the vTPM specification to the documentation, config.go, and
schema description. The following is an example of a vTPM description
that is found under the path /linux/resources/vtpms:

    "vtpms": [
        {
            "statePath": "/var/lib/runc/myvtpm1",
            "vtpmVersion": "2",
            "createCerts": false,
            "runAs": "tss",
            "pcrBanks": "sha1,sha512"
        }
    ]

Signed-off-by: Stefan Berger <[email protected]>
@tianon
Copy link
Member

tianon commented Aug 16, 2025

If I understand correctly, the idea is that a runtime is expected to start an instance of swtpm behind the scenes, and wire up the result inside the container. Is that accurate?

This is perhaps mirroring some of the concerns expressed in #920, but what's the benefit of doing that over running swtpm explicitly and mapping the device or socket from it?

To maybe help explain why this makes me nervous, what do we do if the container dies? The runtime is typically long gone at that point, so what makes sure swtpm shuts down? What if swtpm has a problem and shuts down before the container does? At the level of the runtime, there's no "orchestrator" monitoring processes, there's just a container process and a bunch of kernel resources tied to that process (most of which clean themselves up pretty reasonably when the container exits or dies).

Another aspect is how non-container runtimes (VMs, etc) are expected to implement this. If they can't support this, they should probably simply error, right? The same if swtpm is not installed?

So in short, why is the runtime layer the appropriate place for this and not, say, the orchestrators like containerd, Docker, kubernetes, etc?

@everzakov
Copy link
Author

runc support vtpm is part of the following solution: container remote assertation solution. In this solution, vtpm is a device for storage usage of assertion result.
In this PR, atomic capability of managing vtpm directly will be implemented in runc.

vtpm arch pic1

Container remote assertation process:

  1. Admin deploy a container by k8s api
  2. Runc create vTPM for verification
  3. (planning) runc create vCRTM for measurements.
    PS: since there is no vCRTM de-facto open source project, this functionality will be planned when there is one
  4. Runc start a new business container
  5. vCRTM is used to measure business container files and content
  6. Signature verification result is stored in vTPM device
  7. Report is sent to remote assertation

vTPM:virtualized Trusted Platform Module
vCRTM:virtualized Core Root of Trust Measurement
RAC : remote assertation container who manage the assertation process
RAS: remote assertation service who store the key for verification

In this solution, here is the responsibility of each component:
K8s-api: handle configuration and lifecycle of vtpm for a pod. decide if vtpm is created/deleted/cleaned/recreated. similar behavior as other devices.
kubelet: monitor status of container and vtpm device, and report to k8s api server. node level lifecycle management of vtpm
runc: atomic capability of create/delete/clean vtpm based on the request from containerd.

@everzakov
Copy link
Author

So in short, why is the runtime layer the appropriate place for this and not, say, the orchestrators like containerd, Docker, kubernetes, etc?

This is a good question. If I understand correctly we have a several container extension points:

  1. Kubelet plugins - Dynamic Resource Allocation / Device Manager
  2. Containerd node resource interface (NRI) plugins
  3. Runtime Hooks

We cannot use Runtime Hooks (e.g. createContainer) because the runtime/runc reads the container config only once. So, we won’t be able to extend linux devices.

We cannot use Kubelet Device Manager plugins because there can be a possible use case to share the same vTPM between several containers in a pod.
From KEP 4381 https://github.com/kubernetes/enhancements/blob/master/keps/sig-node/4381-dra-structured-parameters/README.md
device plugin API could be extended to share devices between containers in a single pod,
but possible supporting sharing between pods would need an API similar to Dynamic Resource Allocation.

We cannot use only Kubelet Dynamic Resource Allocation plugins because NodePrepareResources https://github.com/kubernetes/kubernetes/blob/v1.33.4/pkg/kubelet/cm/dra/manager.go#L179
and UnprepareResources https://github.com/kubernetes/kubernetes/blob/v1.33.4/pkg/kubelet/cm/dra/manager.go#L405 calls only once for each pod.
As for GetResources https://github.com/kubernetes/kubernetes/blob/v1.33.4/pkg/kubelet/cm/dra/manager.go#L362 function it reads the CDI devices for each container from its cache.
And if we want that vTPM internal state should be recreated each time when container is re/created (for example retry), we should use it with another extension point.

As for the Node Resouce Interface plugins it can be a good candidate to implement vTPM feature (because it can apply container config adjustments to pass device / device cgroup).
However, it has some weak points:

  1. The Node Resource Interface plugins are run only once when containerd is started so it can not use K8s API to handle vTPM configuration.
    So we need to use it with another extension point.
  2. Containerd restarts. When containerd is restarted there is a process with recovering existing containers https://github.com/containerd/containerd/blob/v2.1.4/internal/cri/server/restart.go#L55 .
    Cri Sandboxes and Containers will be populated. Also NRI plugins have a mechanism of syncronyzing containers / sandboxes https://github.com/containerd/containerd/blob/v2.1.4/internal/nri/nri.go#L455 .
    Swtpm processes are forked and they will exist after containerd restart. Their pid can be saved in container annotations and retrived from container spec.
    However, if the swtpm process is killed and another process is run. Should the NRI plugin delete the second process or return error.
    If the NRI plugin will return error then CRI plugin will be locked in error state. If the NRI plugin will delete the second process and recreate the first,
    then the upper component of the second swtpm process will be in error state (because will try to recreate the second and fail).
    Also the Container interface https://github.com/containerd/containerd/blob/v2.1.4/client/container.go#L52 do not have the method to update spec (the new pid of swtpm process).

If swtpm is run with the runtime we can add it's pid to the container state file that's why such problem won't exist.
If the container will be deleted then we should only kill / delete the necessary swtpm processes.

However, the main weak point of using another container extension points than runtime is how runtime works with devices.
We have several use cases and the most common is to pass different vTPM devices with the same dev path (e.g. /dev/tpm0) to the different containers.
This can be done if we create several devices with generated host names and mknod them to the container by using their major/minor. However,
if the runtime is run in the non-default user namespace / should create the new user namespace, it will used bind instead of mknod https://github.com/opencontainers/runc/blob/v1.3.0/libcontainer/rootfs_linux.go#L916 .
In such cases we should pass the generated host name to the container config.

As for the problem of not existing monitoring tools in runtime, in containerd there is a function to monitor task exit https://github.com/containerd/containerd/blob/v2.1.4/internal/cri/server/events.go#L147 .
After task exit, the task will be deleted. So, swtpm processes will be stopped.

@everzakov
Copy link
Author

Another aspect is how non-container runtimes (VMs, etc) are expected to implement this.

we assign runc to create vtpm because we want to allign the same architecture design as VM platform.
In VM scenario it is vBIOS do the job. While in container, from the component call sequence, it is runc that works as a similar role and position.

vtpm vm vs containerd pic2

@everzakov
Copy link
Author

If they can't support this, they should probably simply error, right? The same if swtpm is not installed?

If the runtime do not have a vTPM feature / swtpm is not installed, then the error should be returned.

@everzakov
Copy link
Author

Sorry for late reply I was on PTO :(

@tianon
Copy link
Member

tianon commented Aug 27, 2025

we assign runc to create vtpm because we want to allign the same architecture design as VM platform.
In VM scenario it is vBIOS do the job.

This isn't quite true though, right? In QEMU at least (I'm not sure about other VM platforms), TPM support requires the operator to pre-launch an instance of swtpm, and manage it themselves outside of QEMU: https://qemu-project.gitlab.io/qemu/specs/tpm.html#the-qemu-tpm-emulator-device, https://wiki.archlinux.org/title/QEMU#Trusted_Platform_Module_emulation

If we take a similar approach in runc, then the swtpm devices or sockets are no different than anything else you might share with the container in the bundle spec.

My biggest concern is the lifecycle management of that swtpm process, because again, runc is not running anymore once the container is up, so from the perspective of runc and the container, nothing will be left behind to manage (or stop/cleanup) that swtpm process.

@everzakov
Copy link
Author

everzakov commented Aug 27, 2025

If we take a similar approach in runc, then the swtpm devices or sockets are no different than anything else you might share with the container in the bundle spec.

yes. However i have my concern: we want to create independent vTPM devices (created by several swtpm processes) and pass them to the different containers under the same dev path in the container (e.g. /dev/tpm0). To do this we need to be sure that their host dev path is different and pass their major and minor with the required container dev path (/dev/tpm0).
Runc can use two commands to create devices under the rootfs - mknod and bind. https://github.com/opencontainers/runc/blob/main/libcontainer/rootfs_linux.go#L916
If mknod is used, then this approach will work.
However, if bind is used, then the error will be returned (no /dev/tpm0 on host). To solve this problem we can pass the device host path instead of required dev path (/dev/tpm0).

So, my concern is the following: we can be sure what command will be used by runc only on runtime level. And this affects the value of dev path in the container device config. Anyway we need to extend current device config or add the new field in the runtime spec.

@everzakov
Copy link
Author

My biggest concern is the lifecycle management of that swtpm process, because again, runc is not running anymore once the container is up, so from the perspective of runc and the container, nothing will be left behind to manage (or stop/cleanup) that swtpm process.

I understand your concern that runc is called only once when the container is up. As i know, runc delete is also called when we need to cleanup the container. In this command we can stop/cleaneup all the created swtpm processes for the container. If you also have a concern that runc delete returns an error code, swtpm processes won't be stopped/cleanup. I think we can add additional checks to stop/cleanup swtpm in UnprepareResources (when we need to sync terminating pod) https://github.com/kubernetes/kubernetes/blob/v1.33.4/pkg/kubelet/cm/dra/manager.go#L405 .

@everzakov
Copy link
Author

If I understand correctly, the idea is that a runtime is expected to start an instance of swtpm behind the scenes, and wire up the result inside the container. Is that accurate?

This is perhaps mirroring some of the concerns expressed in #920, but what's the benefit of doing that over running swtpm explicitly and mapping the device or socket from it?

To maybe help explain why this makes me nervous, what do we do if the container dies? The runtime is typically long gone at that point, so what makes sure swtpm shuts down? What if swtpm has a problem and shuts down before the container does? At the level of the runtime, there's no "orchestrator" monitoring processes, there's just a container process and a bunch of kernel resources tied to that process (most of which clean themselves up pretty reasonably when the container exits or dies).

Another aspect is how non-container runtimes (VMs, etc) are expected to implement this. If they can't support this, they should probably simply error, right? The same if swtpm is not installed?

So in short, why is the runtime layer the appropriate place for this and not, say, the orchestrators like containerd, Docker, kubernetes, etc?

Hello! Dear @tianon, we reconsidered our approach to pass vTPM to the container:

  1. There will be a DRA (K8s Dynamic Resource Allocation https://kubernetes.io/docs/concepts/scheduling-eviction/dynamic-resource-allocation/ ) plugin which will create/delete/monitor swtpm processes.
    In NodePrepareResources it will start swtpm_cuse process, create CDI (container device interface https://github.com/cncf-tags/container-device-interface ) file, and return CDI id in response.
  2. I repeat my concern that we can be sure what command (mknod/mount) will be used by runc only on runtime level that's why we need to pass both "container" and host path to the container config.
  3. In containerd CDI devices will be parsed https://github.com/containerd/containerd/blob/main/internal/cri/server/container_create_linux.go#L104 , vTPMs will be applied to the container runtime config.
  4. In runc we only need to check what "device" path should be passed to the devices config.
  5. Since v1.34 it is possible to create device health monitoring stream https://kubernetes.io/docs/concepts/scheduling-eviction/dynamic-resource-allocation/#device-health-monitoring .
    If swtpm process will be killed and can not be restarted then this device will be marked as unhealthy in pod status.

Possible problems:

  1. We create/mknod devices in vTPM plugin. In order to do this, their major / minor should be in vTPM plugin container cgroup device allowlist.
    E.g. this can be done by using NRI plugin: https://github.com/containerd/nri/blob/main/pkg/runtime-tools/generate/generate.go#L288 .
  2. In vTPM plugin we can not be sure when CRI container will be re/started (it can be possible by using GetContainerEvents call https://github.com/containerd/containerd/blob/main/internal/cri/server/container_events.go .
    However, i think there can be a time lag) that's why swtpm state can not be recreated fully (create the new Endorsement Key Pair and etc) for each time when container is recreated.

In runtime-spec the changes will have only:

"vtpms": [
  {
    "containerPath": "/dev/tpm0",
    "hostPath": "/dev/tpm-generated-0",
    "vtpmMajor": 100,
    "vtpmMinor": 1
  }
],

Now, i'm working on vTPM plugin PoC to check this approach.

@rata
Copy link
Member

rata commented Sep 11, 2025

@everzakov Handling the swtpm process creation/lifecycle outside of the runtime as @tianon was saying is great, I think it is a blocker otherwise.

Some questions:

  1. With that example runtime-spec config you shared, what would runc need to do from a high-level point of view? Is it a bind mount of the hostPath in the containerPath? In that case, why do we need the major/minor?
  2. My understanding is that we can create as many vTPMs on a host as we want, is this right? I'm not a DRA expert, but do all DRA device drivers need a capacity or can they say "infinite"? Checking the doc you linked, it doesn't seem like it's possible today to have "no capacity" in a DRA device driver. How would that work in this case?

I think if there isn't any host-device we need to "consume" for swtpm, then not sure why we are using DRA. Or can DRA model things with "infinite" capacity too?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants