Skip to content

Commit 3f97aff

Browse files
authored
VEP 48: Added some changes in HCO API and behaviour (#75)
The following changes are added to VEP 48: 1. Rename the HCO feature gate from `enableMultiArchCommonBootImageImport` to `enableMultiArchBootImageImport`. The `"Common"` part of this feature gate name is wrong and misleading. 2. Added two cases where HCO triggers alerts. 3. Added the `nodeInfo` field to the `HyperConverged` status. 4. Added the `originalSupportedArchitectures` field to the status of each one of the DataimportCronTemplate objects in the HyperConverged status. 5. Added an optional `conditions` field to the status of each one of the DataimportCronTemplate objects in the HyperConverged status. 6. Remove the content of the `Alternatives` section, as all the suggestions were either accepted or rejected. 7. Emphasize error handling. Signed-off-by: Nahshon Unna-Tsameret <[email protected]>
1 parent f4c3247 commit 3f97aff

File tree

1 file changed

+119
-99
lines changed

1 file changed

+119
-99
lines changed

veps/sig-storage/dic-on-heterogeneous-cluster/dic-on-heterogeneous-cluster.md

Lines changed: 119 additions & 99 deletions
Original file line numberDiff line numberDiff line change
@@ -130,17 +130,33 @@ the control-plane nodes and one for the workload nodes.
130130
These lists will be dynamically updated as nodes are added or removed from the cluster.
131131

132132
By default, the HCO will identify workload nodes as those labeled with `node-role.kubernetes.io/worker`. If the
133-
`spec.workloads.nodePlacement` field in the `HyperConverged` CR is defined, HCO will use this field to determine the
133+
`spec.workloads.nodePlacement` field in the `HyperConverged` CR is populated, HCO will use this field to determine the
134134
workload nodes instead.
135135

136-
HCO will identify control-plane nodes as those labeled with `node-role.kubernetes.io/control-plane`.
136+
HCO will identify control-plane nodes as those labeled with `node-role.kubernetes.io/control-plane` or with
137+
`node-role.kubernetes.io/master`.
137138

138139
>**Note**: The control-plane architecture list is expected to always be with only one architecture, as the control-plane
139140
> nodes are expected to be homogeneous. HCO will not force this, though.
140141
142+
HCO will publish the lists of workloads and control-plane architectures in the `HyperConverged` CR's `status.nodeInfo`
143+
field.
144+
145+
#### Error Handling
146+
If the `enableMultiArchBootImageImport` feature gate is not enabled, but the cluster contains workloads nodes with
147+
multiple architectures, HCO will trigger a warning level alert, to notify the user that the scheduling of virtual
148+
machines may fail, or they may not run as expected, due to the random scheduling on a node with architecture different
149+
from the image architecture.
150+
151+
The alert will point to a runbook that will explain the issue and will suggest two alternatives:
152+
1. Enable the `enableMultiArchBootImageImport` feature gate in the `HyperConverged` CR, while explaining the risk of
153+
using a feature that is not generally available.
154+
2. Set the `spec.workloads.nodePlacement` field in the `HyperConverged` CR, to limit the workload nodes to a single
155+
architecture nodes, which will ensure that all virtual machines will run on a supported architecture.
156+
141157
### Reconciling the `SSP` Custom Resource
142158
The following changes in HCO behavior will be applied if:
143-
1. The new `enableMultiArchCommonBootImageImport` feature gate is enabled in the `HyperConverged` CR. This feature gate
159+
1. The new `enableMultiArchBootImageImport` feature gate is enabled in the `HyperConverged` CR. This feature gate
144160
is disabled by default.
145161
2. The cluster is not a single node cluster.
146162

@@ -159,27 +175,35 @@ When reconciling the `SSP` CR, HCO will:
159175
`ssp.kubevirt.io/dict.architectures` annotation for each `DataImportCronTemplate`. When adding the
160176
`DataImportCronTemplate` object to the `SSP` CR, HCO will clean the architectures that are not present in cluster
161177
from the annotation.
162-
163-
If there are no architectures that are supported by the image and present in the cluster, HCO will not add this
164-
`DataImportCronTemplate` object to the `SSP` CR.
165178

166179
5. HCO will not add the `ssp.kubevirt.io/dict.architectures` annotation to the custom `DataImportCronTemplate` objects.
167180
The user is responsible for adding the annotation to their custom `DataImportCronTemplate` objects.
168181

169-
However, if the annotation exists in a custom `DataImportCronTemplate` objects, HCO will clean the architectures that
170-
are not present in the cluster from the annotation.
182+
However, HCO will respect the `ssp.kubevirt.io/dict.architectures` annotation, if it exists in a custom
183+
`DataImportCronTemplate` objects, and will perform the same logic as it does for the pre-defined
184+
DataImportCronTemplates.
171185

172-
For example:
186+
6. HCO will add the `ssp.kubevirt.io/dict.architectures` annotation to the `DataImportCronTemplate` objects in
187+
the `HyperConverged` CR status, with the same value as used in the `SSP` CR.
173188

174-
If the pre-prepared `DataImportCronTemplate` object is annotated with `amd64,arm64`, and the cluster worker nodes are
175-
both `amd64` and `s390x`, HCO will set the corresponding object in the `SSP` CR like this:
189+
HCO will populate the new `status.originalSupportedArchitectures` field in each DataImportCronTemplate object in the
190+
`dataImportCronTemplates` field of the `HyperConverged` CR's status, with the original value of the annotation,
191+
> The original value for pre-defined `DataImportCronTemplate` objects is the value that was set in the HCO image,
192+
and the original value for custom `DataImportCronTemplate` objects is the value that was set by the user in the
193+
`HyperConverged` CR's `spec.dataImportCronTemplates` field.
194+
195+
#### Example:
196+
If the pre-prepared `DataImportCronTemplate` object is annotated with `amd64,arm64`, and the cluster worker node
197+
architectures are `amd64` and `s390x`, HCO will set the corresponding object in the `SSP` CR like this:
176198

177199
```yaml
178200
apiVersion: ssp.kubevirt.io/v1beta3
179201
kind: SSP
180202
spec:
181203
enableMultipleArchitectures: true
182204
cluster:
205+
controlPlaneArchitectures:
206+
- amd64
183207
workloadArchitectures:
184208
- amd64
185209
- s390x
@@ -192,6 +216,17 @@ spec:
192216
...
193217
```
194218

219+
#### Error Handling
220+
If there are no architectures that are supported by the image and present in the cluster, HCO will not add this
221+
`DataImportCronTemplate` object to the `SSP` CR.
222+
223+
For debug and troubleshooting, HCO will add this DataImportCronTemplate object to the `status.dataImportCronTemplates`
224+
list in the `HyperConverged` CR, even though it wasn't added to the `SSP` CR. HCO will set the
225+
`ssp.kubevirt.io/dict.architectures` annotation to an empty string, and will add the `Deployed` condition with status
226+
of `False` to the status of this DataImportCronTemplate object.
227+
228+
In addition, HCO will trigger an info level alert, to notify the user that the image is not supported in the cluster.
229+
195230
### Reconciling the `DataImportCron` Custom Resources
196231
Upon reconciling the `SSP` CR, for each `DataImportCronTemplate` object, SSP will create or update a `DataImportCron` CR
197232
for each architecture listed in the `template.kubevirt.io/architecture` annotation:
@@ -329,7 +364,7 @@ If there is more than one supported architecture, the default one will be the on
329364
The architecture agnostic `DataSource` will point to the default, architecture specific `DataSource` CR, by populating
330365
the `spec.source.dataSource` new field, and will not set any other field under `spec.source`.
331366

332-
On upgrade from a previous version of HCO, or when the `enableMultiArchCommonBootImageImport` feature gate is enabled in
367+
On upgrade from a previous version of HCO, or when the `enableMultiArchBootImageImport` feature gate is enabled in
333368
the HyperConverged CR, there will be already an existing `DataSource` CR for the specific image. SSP will modify the
334369
existing `DataSource` CR by replacing the existing `spec.source.pvc` or `spec.source.snapshot` fields, with the new
335370
`spec.source.dataSource` field.
@@ -479,105 +514,90 @@ By adopting the new CDI and SSP APIs, the `DataImportCronTemplate` type in Hyper
479514
as the type of the `spec` field is CDI's `DataImportCronSpec`, that one of its nested fields is the
480515
`DataVolumeSourceRegistry` type.
481516

482-
The HyperConverged API will introduce the new `enableMultiArchCommonBootImageImport` feature gate, with default value of
517+
#### New feature Gate
518+
The HyperConverged API will introduce the new `enableMultiArchBootImageImport` feature gate, with default value of
483519
`false`.
484520

485-
## Alternatives
486-
### Multiplying the `DataImportCronTemplate` into Multiple `DataSource` CRs
487-
It is clear that any design will end up with multiple `DataSource` CRs, one for each supported architecture, pointing to
488-
the latest `VolumeSnapshot`/`PVC` for that architecture.
489-
490-
The question is how high in the hierarchy the multiplication should be occurred. The above design suggests that it will
491-
be SSP to create multiple `DataImportCron` CRs for each multi-arch `DataImportCronTamplate`.
492-
493-
Below are too additional suggestions for the same question.
494-
495-
#### *[Rejected]* Create Multiple `DataImportCronTemplate` Objects, One for Each Architecture
496-
HCO could create a new `DataImportCronTemplate` object for each architecture in the `SSP` CR, and the user will be
497-
responsible for doing the same for custom `DataImportCronTemplate` objects.
498-
499-
pros:
500-
501-
- no changes are required from SSP, except for adopting the new CDI API, and changes regarding the VM templates.
502-
503-
cons:
504-
505-
- the user will be responsible for creating and maintaining the `DataImportCronTemplate` objects, and
506-
the user will need to create a new `DataImportCronTemplate` object for each architecture.
507-
- the `SSP` CR will be huge, and will contain a lot of `DataImportCronTemplate` objects. It will be hard to maintain it
508-
or understand it.
509-
- it looses the meaning of the templating mechanism, by the `DataImportCronTemplate` type.
510-
511-
#### *[Rejected]* Create Multiple `DataSource` CRs from a Single `DataImportCron` Object
512-
No changes in SSP, regarding the `DataImportCronTemplate` objects. SSP will create a single `DataImportCron` CR for each
513-
`DataImportCronTemplate` object in the `SSP` CR, as done today. The `ssp.kubevirt.io/dict.architectures` (with a new
514-
name, e.g. `cdi.kubevirt.io/dict.architectures`) annotation will be copied to the `DataImportCron` CR, as any other
515-
annotation.
521+
This feature gate will enable the heterogeneous cluster support for golden images, as described in this VEP.
516522

517-
CDI will create multiple `DataSource` CRs, one for each architecture, according to the
518-
`cdi.kubevirt.io/dict.architectures` annotation, and will import the images into architecture specific `VolumeSnapshot`
519-
resources.
523+
#### HyperConverged Status
524+
The HyperConverged `status` field will be extended with a new `nodeInfo` field, with two sub-fields:
525+
* `controlPlaneArchitectures` - a list of the architectures of the control-plane nodes
526+
* `workloadsArchitectures` - a list of the architectures of the workloads nodes
520527

521-
When performing the actual image import, if the `cdi.kubevirt.io/dict.architectures` annotation is set in
522-
the `DataImportCron` CR, the CDI will create architecture specific `DataVolume` CR, for each architecture
523-
listed in the annotation. CDI will set the `DataVolome` CR's `spec.source.registry.platform.architecture` field, to the
524-
required architecture.
525-
526-
CDI will add the `template.kubevirt.io/architecture` label to the `DataSource` CR, with the architecture of the image,
527-
and will set the `cdi.kubevirt.io/storage.import.datasource-name` label to the value of the `spec.managedDataSource`
528-
field in the `DataImportCron` CR.
529-
530-
CDI will need to remove any orphan `DataSource` or `VolumeSnapshot` CRs, that are not referenced by any `DataImportCron`
531-
CR.
532-
533-
SSP will still need to create architecture specific templates, and will still need to set the `DataSource` name
534-
in each template, to the architecture specific `DataSource` CR name.
535-
536-
Open questions:
537-
1. SSP currently creates the `DataSource` CRs in some cases. We have two options here:
538-
1. SSP will have to have the same logic as in CDI, when creating the architecture specific `DataSource` CRs.
539-
2. SSP will no longer create the `DataSource` CRs, and will let CDI to do that.
528+
For example:
529+
```yaml
530+
apiVersion: hco.kubevirt.io/v1beta1
531+
kind: HyperConverged
532+
status:
533+
nodeInfo:
534+
controlPlaneArchitectures:
535+
- amd64
536+
workloadsArchitectures:
537+
- amd64
538+
```
540539
541-
2. What is the best way for SSP to find out which architecture specific templates to create?
542-
1. SSP can use the new `spec.cluster.workloadArchitectures` field in the `SSP` CR, and create the templates
543-
accordingly.
544-
545-
The problem with this approach is that if a specific `DataImportCronTemplate` object supports only
546-
part of the architectures listed in the `spec.cluster.workloadArchitectures` field, there will be unused templates
547-
deployed. It can be mitigated by not deploying a template, if its `DataSource` was not found.
548-
549-
2. SSP can use the `cdi.kubevirt.io/dict.architectures` annotation in each `DataImportCronTemplate`, and create the
550-
templates accordingly.
551-
552-
The problem with this approach is that SSP will need to implement the logic to find out
553-
which `DataImportCronTemplate` objects is related to each template. SSP can use the `spec.managedDataSource` field
554-
in the `DataImportCronTemplate` object, to find the templates using the same `DataSource` name. It may be too
555-
complex to implement this solution.
556-
> **Important**: this alternative will require no API changes in SSP, as the `spec.cluster.workloadArchitectures`
557-
field is not needed.
540+
#### DataImportCronTemplates Status
541+
The existing HyperConverged `status.dataImportCronTemplates` field contains a list of DataImportCronTemplate objects.
542+
Each one of these objects contains its own `status` field.
558543

559-
pros:
544+
This field will be extended with a new `originalSupportedArchitectures` field, which will contain a comma separated list
545+
of the architectures supported by the image, as listed in the `ssp.kubevirt.io/dict.architectures` annotation of the HCO
546+
image. This field will be populated only if the feature is enabled.
560547

561-
- no changes are required from SSP, except for adopting the new CDI API, and changes regarding the VM templates.
562-
- The development of the changes in SSP and in HCO is not dependent on CDI.
563-
- If the 2nd option in the 2nd open question is chosen, the development of the changes in HCO is not dependent on SSP,
564-
and each component can be developed and released independently.
565-
- In deployments without SSP, this solution will work as well, if the user creates the `DataImportCron` CRs manually.
548+
For example:
549+
```yaml
550+
apiVersion: hco.kubevirt.io/v1beta1
551+
kind: HyperConverged
552+
status:
553+
dataImportCronTemplates:
554+
- metadata:
555+
annotations:
556+
...
557+
ssp.kubevirt.io/dict.architectures: amd64
558+
name: centos-stream10-image-cron
559+
spec:
560+
...
561+
status:
562+
commonTemplate: true
563+
originalSupportedArchitectures: amd64,arm64,s390x
564+
```
566565

567-
cons:
566+
In addition, each DataImportCronTemplate object in the `status.dataImportCronTemplates` field will include a condition
567+
list. This field will be populated only in cases of an issue with the specific DataImportCronTemplate.
568568

569-
- template handling in SSP seems to be more complex.
569+
For example: below is the DataImportCronTemplate object in the `status.dataImportCronTemplates` field, when the original
570+
`ssp.kubevirt.io/dict.architectures` annotation in the HyperConverged's spec, contains two not supported architectures.
570571

571-
### *[Accepted]* The Source of the Architecture List for Each Predefined Image
572-
For the predefined images, the architecture list is known in advance, and can be hardcoded in the DataImportCronTemplate
573-
file in the HCO image. HCO then will check what are the workload node architectures in the cluster, and will add the
574-
`ssp.kubevirt.io/dict.architectures` annotation to the `DataImportCronTemplate` object in the `SSP` CR, only with the
575-
architectures that are supported by the image, and by the cluster.
572+
Notice the empty `ssp.kubevirt.io/dict.architectures` annotation, the `originalSupportedArchitectures` field, that
573+
reflects the original value of the annotation, and the `conditions` field, that contains a single `Deployed` condition
574+
with a status of `False`, and a reason of `UnsupportedArchitectures`.
576575

577-
However, these images are frequently updated, and we can never know if the architecture list that was set at HCO image
578-
build time, is still valid or not, at any point in the future.
576+
```yaml
577+
apiVersion: hco.kubevirt.io/v1beta1
578+
kind: HyperConverged
579+
status:
580+
dataImportCronTemplates:
581+
- metadata:
582+
annotations:
583+
cdi.kubevirt.io/storage.bind.immediate.requested: "true"
584+
ssp.kubevirt.io/dict.architectures: ""
585+
name: my-image
586+
spec:
587+
...
588+
status:
589+
conditions:
590+
- lastTransitionTime: "2025-07-09T11:00:30Z"
591+
message: DataImportCronTemplate has no supported architectures for the current
592+
cluster
593+
reason: UnsupportedArchitectures
594+
status: "False"
595+
type: Deployed
596+
originalSupportedArchitectures: someUnsupportedArch,otherUnsupportedArch
597+
```
579598

580-
It seems like a low risk, but it will require a new release of HCO to update the architecture list.
599+
## Alternatives
600+
n/a (All the alternatives were either already accepted, or rejected)
581601

582602
## Scalability
583603

0 commit comments

Comments
 (0)