Skip to content

Conversation

@sbernauer
Copy link
Member

@sbernauer sbernauer commented Aug 22, 2025

Part of stackabletech/issues#642

Note

Load-balancing over the secret-op DaemonSet Pods did not work on clusters with multiple nodes, see #634 (comment). This has been addressed in #645 and #646.

This PR marks samAccountName as non-experimental as per #627. It is the first use of the stackable-versioned macro as well as the conversion webhook machinery. See #634 (comment) for the release note.

Note

This PR makes changes to things that are templated by operator-templating! We are aware of this and will have a diff until we roll this out to all operators and update operator-templating. secret-operator is our guinea-pig

Copy link
Member

@Techassi Techassi left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this is mostly done, but I left a few small comments/reminders.

@Techassi Techassi marked this pull request as ready for review September 9, 2025 08:41
@Techassi Techassi moved this from Development: In Progress to Development: Waiting for Review in Stackable Engineering Sep 9, 2025
@Techassi Techassi changed the title feat: Introduce CRD versioning feat: Add v1alpha2 for SecretClass, rename experimentalGenerateSamAccountName Sep 9, 2025
@sbernauer
Copy link
Member Author

I can not approve my own PR but LGTM

@sbernauer sbernauer moved this from Development: Waiting for Review to Development: In Review in Stackable Engineering Sep 16, 2025
@razvan
Copy link
Member

razvan commented Sep 19, 2025

I'll put this here for lack of a better place now.

I tried installing this op version with OLM on OpenShift.

OLM of course has the same problem as Helm (see comment above). It tries to create a tls secret class but there is no CRD for it.

I manually added the exported CRDs to OLM, but then the operator refuses to start with this error:

secret-operator 2025-09-19T10:47:38.191165Z  INFO reconcile_crds: stackable_webhook::servers::conversion: Reconciling CRDs crds=["secretclasses.secrets.sta
kable.tech", "truststores.secrets.stackable.tech"[]
secret-operator Error: failed to run conversion webhook
secret-operator
secret-operator Caused by:
secret-operator     0: failed to reconcile CRDs
secret-operator     1: failed to update CRD "secretclasses.secrets.stackable.tech"
secret-operator     2: ApiError: Apply failed with 2 conflicts: conflicts with "catalog" using apiextensions.k8s.io/v1:
secret-operator        - .spec.versions
secret-operator        - .spec.conversion.strategy: Conflict (ErrorResponse { status: "Failure", message: "Apply failed with 2 conflicts: conflicts with \"
catalog\" using apiextensions.k8s.io/v1:\n- .spec.versions\n- .spec.conversion.strategy", reason: "Conflict", code: 409 })
secret-operator     3: Apply failed with 2 conflicts: conflicts with "catalog" using apiextensions.k8s.io/v1:
secret-operator        - .spec.versions
secret-operator        - .spec.conversion.strategy: Conflict
stream closed EOF for stackable-operators/secret-operator-daemonset-zvvdt (secret-operator)

No idea if this is expected or a bug.

My 2 cents: I don't know why you decided to take CRD management out of the hands of package managers because I don't remember discussing it and I find no explanation. But I think one implication of this is that we now have to implement Helm ourselves by having to ensure object creation order at least.

@razvan
Copy link
Member

razvan commented Sep 19, 2025

Tests on OKD are 🟢 though no test was added or updated for this PR.

I removed tls from the OLM manifests and installed it manually after the operator installation.

Then I ran the test suite:

❯ ./scripts/run-tests --skip-operator secret --test-suite openshift
...
--- PASS: kuttl (217.55s)
    --- PASS: kuttl/harness (0.00s)
        --- PASS: kuttl/harness/listener_openshift-true (78.90s)
        --- PASS: kuttl/harness/tls_openshift-true_rsa-key-length-2048_custom-secret-names-True (90.95s)
        --- PASS: kuttl/harness/tls_openshift-true_rsa-key-length-3072_custom-secret-names-True (47.89s)
        --- PASS: kuttl/harness/tls_openshift-true_rsa-key-length-3072_custom-secret-names-False (29.27s)
        --- PASS: kuttl/harness/tls-truststore_openshift-true_truststore-target-kind-Secret (11.55s)
        --- PASS: kuttl/harness/kerberos_krb5-1.21.1_openshift-true (80.06s)
        --- PASS: kuttl/harness/tls-truststore_openshift-true_truststore-target-kind-ConfigMap (10.13s)
        --- PASS: kuttl/harness/tls_openshift-true_rsa-key-length-2048_custom-secret-names-False (26.14s)
        --- PASS: kuttl/harness/cert-manager-tls_openshift-true (35.56s)
PASS

Copy link
Member Author

@sbernauer sbernauer left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Other than that LGTM. The label part obviously need to be finished, which is not pushed and reviewed yet

Techassi and others added 7 commits October 16, 2025 10:04
All CRDs are now maintained (created and patched) by the operator. They
are no longer deplyoed by Helm and as such are removed from the Helm
Chart templates. A YAML file is still checked in (extra/crds.yaml) to
ensure diffs are visible and tracked by Git.

Co-authored-by: Sebastian Bernauer <[email protected]>
The operator can now handle CRD conversions via a webhook and maintains
it's own CRDs via the CRD maintainer. As such, it needs permissions to
create and patch CRDs.

Co-authored-by: Sebastian Bernauer <[email protected]>
@Techassi Techassi force-pushed the feat/crd-versioning branch from 788dd08 to 7d625cb Compare October 16, 2025 08:43
@Techassi
Copy link
Member

@sbernauer just discovered we run into issues when the cluster has more than a single node, because the secret-operator is deployed via a DaemonSet. We at least need to ensure that only one instance of the conversion webhook exists because otherwise the Kubernetes API server is unable to verify the TLS certificate.

We will discuss how to move forward on Monday.

Techassi added a commit that referenced this pull request Oct 20, 2025
The splits the deployment of the secret-operator as a whole into
two parts:

- The controller is deployed via a Deployment which ensures only
  a single instance of the secret-operator in controller mode is
  running in a Kubernetes cluster. This can potentially lead to
  perfomance issues and as such should be monitored going forward.
- The CSI server is deployed via a DaemonSet (unchanged) as this
  server is needed on every node to provision requested secret
  volumes.

This refactor is introduced in preparation for #634, in which only
a single instance of the CRD conversion webhook must exist as
otherwise TLS certificate verification will fail with multiple
available certificates.
github-merge-queue bot pushed a commit that referenced this pull request Oct 21, 2025
* refactor: Split into Deployment and DaemonSet

The splits the deployment of the secret-operator as a whole into
two parts:

- The controller is deployed via a Deployment which ensures only
  a single instance of the secret-operator in controller mode is
  running in a Kubernetes cluster. This can potentially lead to
  perfomance issues and as such should be monitored going forward.
- The CSI server is deployed via a DaemonSet (unchanged) as this
  server is needed on every node to provision requested secret
  volumes.

This refactor is introduced in preparation for #634, in which only
a single instance of the CRD conversion webhook must exist as
otherwise TLS certificate verification will fail with multiple
available certificates.

* chore: Add changelog entry

* chore: Update comment

* refactor: Adjust values.yaml file to be closer to listener-operator

* chore: Adjust changelog entry
github-merge-queue bot pushed a commit that referenced this pull request Oct 21, 2025
* refactor: Split into Deployment and DaemonSet

The splits the deployment of the secret-operator as a whole into
two parts:

- The controller is deployed via a Deployment which ensures only
  a single instance of the secret-operator in controller mode is
  running in a Kubernetes cluster. This can potentially lead to
  perfomance issues and as such should be monitored going forward.
- The CSI server is deployed via a DaemonSet (unchanged) as this
  server is needed on every node to provision requested secret
  volumes.

This refactor is introduced in preparation for #634, in which only
a single instance of the CRD conversion webhook must exist as
otherwise TLS certificate verification will fail with multiple
available certificates.

* chore: Add changelog entry

* chore: Update comment

* refactor: Adjust values.yaml file to be closer to listener-operator

* chore: Adjust changelog entry
@Techassi
Copy link
Member

I ran some tests on a local multi-node kind cluster (one control-plane node and two worker nodes). The integration tests succeeded as expected:

--- PASS: kuttl (156.15s)
    --- PASS: kuttl/harness (0.00s)
        --- PASS: kuttl/harness/tls_openshift-false_rsa-key-length-3072_custom-secret-names-True (48.16s)
        --- PASS: kuttl/harness/tls-truststore_openshift-false_truststore-target-kind-ConfigMap (7.66s)
        --- PASS: kuttl/harness/tls-truststore_openshift-false_truststore-target-kind-Secret (7.28s)
        --- PASS: kuttl/harness/tls_openshift-false_rsa-key-length-2048_custom-secret-names-True (15.59s)
        --- PASS: kuttl/harness/kerberos_krb5-1.21.1_openshift-false (86.53s)
        --- PASS: kuttl/harness/tls_openshift-false_rsa-key-length-3072_custom-secret-names-False (26.12s)
        --- PASS: kuttl/harness/tls_openshift-false_rsa-key-length-2048_custom-secret-names-False (20.24s)
        --- PASS: kuttl/harness/listener_openshift-false (26.31s)
        --- PASS: kuttl/harness/cert-manager-tls_openshift-false (49.37s)
PASS

I also applied and retrieved a v1alpha1 SecretClass and the conversion worked without any issues. This seems to indicate that the work done in #645 works as intended.

Copy link
Member

@NickLarsenNZ NickLarsenNZ left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, I watched the tests pass

@Techassi
Copy link
Member

Techassi commented Oct 22, 2025

I also ran the Trino integration tests, they all (but two) passed on the first try. The failing ones passed as well when selectively running them.

I wanted to paste the results here, but sadly I lost the results in my terminal buffer.

@Techassi Techassi added this pull request to the merge queue Oct 22, 2025
@sbernauer
Copy link
Member Author

Famous last words :D

Comment on lines -20 to -24
# Load the latest CRDs from Nix
watch_file('result')
if os.path.exists('result'):
k8s_yaml('result/crds.yaml')

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This will come back on the next templating run, until it's removed from there.

Merged via the queue into main with commit 76936be Oct 22, 2025
17 checks passed
@Techassi Techassi deleted the feat/crd-versioning branch October 22, 2025 13:41
Comment on lines +14 to +15
webhook.stackable.tech/conversion: enabled
{{- include "operator.selectorLabels" . | nindent 4 }}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Comment on lines +119 to +123
# We generate a crds.yaml, so that the effect of code changes are visible.
# The operator will take care of the CRD rollout itself.
crds:
mkdir -p deploy/helm/"${OPERATOR_NAME}"/crds
cargo run --bin stackable-"${OPERATOR_NAME}" -- crd | yq eval '.metadata.annotations["helm.sh/resource-policy"]="keep"' - > "deploy/helm/${OPERATOR_NAME}/crds/crds.yaml"
mkdir -p extra
cargo run --bin stackable-"${OPERATOR_NAME}" -- crd > extra/crds.yaml
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this will be overwritten by templating

nix/** linguist-generated
Cargo.nix linguist-generated
crate-hashes.json linguist-generated
extra/crds.yaml linguist-generated
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This will be overwritten by templating

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think maybe we don't want to hide this as generated (even though it is). It could be handy during reviews to see the changes to the CRDs as a last-line-of-defense.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ohhh this is what this is for? Big +1, we should always review the actual CRD change :)

image.tar

tilt_options.json
local_values.yaml
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this will be overwritten by templating

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I can highly recommend adding _private/ to a global .gitignore, or adding * to _private/.gitignore for local stuff while keeping the branch clean.

@NickLarsenNZ
Copy link
Member

I will make the operator-templating changes now


chart-clean:
rm -rf "deploy/helm/${OPERATOR_NAME}/configs"
rm -rf "deploy/helm/${OPERATOR_NAME}/crds"
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Shouldn't this be kept, but changed to extra?

@Techassi Techassi moved this from Development: In Review to Development: Done in Stackable Engineering Oct 23, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

release-note Denotes a PR that will be considered when it comes time to generate release notes.

Projects

Status: Development: Done

Development

Successfully merging this pull request may close these issues.

Mark samAccountName generation as non-experimental

4 participants