Skip to content

Support custom list of services to be added to /etc/hosts in cluster DNS operator - RFE-4145 #2435

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 1 commit into
base: master
Choose a base branch
from

Conversation

t-cas
Copy link

@t-cas t-cas commented Aug 4, 2025

prerequisite for PR openshift/cluster-dns-operator#441 due to new field introduced in DNS CRD

@openshift-ci openshift-ci bot added the size/M Denotes a PR that changes 30-99 lines, ignoring generated files. label Aug 4, 2025
Copy link
Contributor

openshift-ci bot commented Aug 4, 2025

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by: t-cas
Once this PR has been reviewed and has the lgtm label, please assign joelspeed for approval. For more information see the Code Review Process.

The full list of commands accepted by this bot can be found here.

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

Copy link
Contributor

openshift-ci bot commented Aug 4, 2025

Hello @t-cas! Some important instructions when contributing to openshift/api:
API design plays an important part in the user experience of OpenShift and as such API PRs are subject to a high level of scrutiny to ensure they follow our best practices. If you haven't already done so, please review the OpenShift API Conventions and ensure that your proposed changes are compliant. Following these conventions will help expedite the api review process for your PR.

@openshift-ci openshift-ci bot added the needs-ok-to-test Indicates a PR that requires an org member to verify it is safe to test. label Aug 4, 2025
Copy link
Contributor

openshift-ci bot commented Aug 4, 2025

Hi @t-cas. Thanks for your PR.

I'm waiting for a openshift member to verify that this patch is reasonable to test. If it is, they should reply with /ok-to-test on its own line. Until that is done, I will not automatically test new commits in this PR, but the usual testing commands by org members will still work. Regular contributors should join the org to skip this step.

Once the patch is verified, the new status will be reflected by the ok-to-test label.

I understand the commands that are listed here.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

Copy link
Contributor

@everettraven everettraven left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I noticed this is a change to an existing v1 API, which means that new fields need to be feature-gated and go through the feature promotion process

Comment on lines 98 to 102
// nodeServices specifies K8s services for which entries should be added
// to /etc/hosts by the node resolver. These services will be added in addition to
// the default "image-registry.openshift-image-registry.svc" service.
// Each service should be a relative name following the format "<service>.<namespace>.svc";
// for each relative name, an alias with the CLUSTER_DOMAIN suffix will also be added.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This really sounds like a list of references to services on the cluster that we are just asking users to specify using the domain format.

Based on this existing description, I think it might be more intuitive to end users if we follow the pattern of a list of object references instead of requiring users to put it in the format <service>.<namespace>.svc.

Using a list of object references would also simplify the validations we have to put in place because we don't need a regular expression and there are known validations for namespaces and service names that we can implement via CEL validations.

The resulting end user usage would look something like:

nodeServices:
  - name: bar
    namespace: foo

Then, in the operator code you can build the domain format using that information.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

thanks for your feedback, that's indeed a better format, PR updated accordingly

@openshift-ci openshift-ci bot added size/L Denotes a PR that changes 100-499 lines, ignoring generated files. and removed size/M Denotes a PR that changes 30-99 lines, ignoring generated files. labels Aug 5, 2025
@JoelSpeed
Copy link
Contributor

Do we have an EP explaining what this new feature is? As @everettraven mentions, this should be behind a feature gate as a new feature extending an existing API

@t-cas
Copy link
Author

t-cas commented Aug 7, 2025

Do we have an EP explaining what this new feature is? As @everettraven mentions, this should be behind a feature gate as a new feature extending an existing API

@JoelSpeed I just added more details about the proposal as part of RFE https://issues.redhat.com/browse/RFE-4145, tell me if you have any questions or need more details.

Also, thanks for the suggestion regarding the feature gate but I’d like to clarify the rationale behind not introducing one in this case.
This enhancement adds a new optional field (nodeServices) to the existing DNS.operator.openshift.io API. If this field is not set, the behavior remains exactly as it is today, with the default hardcoded service (image-registry.openshift-image-registry.svc) being used. In that sense, the current implementation already acts as a natural feature gate: if the field is unset, the feature is effectively disabled.
The goal of this change is to make an existing behavior configurable, not to introduce a fundamentally new capability. It extends the current functionality of injecting DNS service entries into /etc/hosts at the node level, which is already part of the operator’s core responsibilities.
That said, I’m open to understanding if there’s a specific risk or compatibility concern that a formal feature gate would help mitigate. Is there a scenario you foresee where this optional field could impact stability, upgrade paths, or API guarantees?

@everettraven
Copy link
Contributor

@t-cas The purpose of the feature-gate process is to ensure that new features:

  • Have functionality implemented before promotion to GA and thus being supported
  • Have sufficient testing that reports into component readiness so we can detect regressions

More specifically for your feature, you mention:

The goal of this change is to make an existing behavior configurable, not to introduce a fundamentally new capability. It extends the current functionality of injecting DNS service entries into /etc/hosts at the node level, which is already part of the operator’s core responsibilities.
That said, I’m open to understanding if there’s a specific risk or compatibility concern that a formal feature gate would help mitigate. Is there a scenario you foresee where this optional field could impact stability, upgrade paths, or API guarantees?

This is the purpose of an enhancement proposal. It allows us to better understand the motivation of the feature, proposed implementation and evaluate the feature for some of those potential concerns. If this wasn't previously configurable, how does making it configurable impact assumptions users and other systems on an OpenShift cluster may have made about the existing behavior?

If the linked RFE, you mentioned The node-resolver daemonset is updated dynamically when the custom resource changes. - what does this mean? How do we rollout the node-resolver? Is there any chance of a rollout causing a disruption on the cluster?

These are all the kinds of questions I would expect to be answered by an EP.

@t-cas
Copy link
Author

t-cas commented Aug 7, 2025

@everettraven Thanks for the detailed explanation, this helps clarify the broader purpose of the feature gate process, and I fully appreciate its value in ensuring stability, readiness, and supportability of new features.

In this specific case, I’m trying to assess where the feature gate adds value.
The node-resolver daemonset’s role remains the same: injecting DNS service entries into /etc/hosts at node level. The only change here is making the list of services configurable via a new optional field. If nodeServices is not set, the behavior remains exactly as it is today. In that sense, the field itself acts as a natural gate, if you don’t use it, the feature is effectively disabled.

Regarding the rollout: the daemonset may be restarted when the custom resource changes, but this follows standard Kubernetes rolling update daemonset behavior. The node-resolver does not perform any live processing that could be impacted; it simply sets up /etc/hosts on the node where it runs. So in practice, restarting it has no observable impact on cluster operations.

Finally, I initially didn’t expect this small change to require a dedicated enhancement proposal, since it doesn’t really alter the behavior of the node-resolver daemonset, what do you think ?

I’ll make sure to include this clarification in linked RFE.
If there are specific risks or assumptions about the existing behavior that you think could be affected, I’d be happy to explore and address them.

Copy link
Contributor

@JoelSpeed JoelSpeed left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@t-cas OpenShift has moved in a direction of gating every new feature, independent of size. We have a defined process which allows features to be developed in tech preview, and gain automated testing to prove out their stability before we add them to the default feature set, and release them to customers.

A benefit of our feature gating mechanics, is that it allows features to be implemented across release boundaries. We are currently about 2 weeks from feature freeze for 4.20. Normally at this point in time, we wouldn't be looking at adding new feature APIs as they risk shipping without the backing implementation. That doesn't matter when it's gated, but does matter when it isn't.

I've left a review for now with some suggestions for improvements to the API, but in general the shape looks fairly simple and I agree this is a fairly minor feature.

That said, there are things we need to consider, which I'll reach out internally about.
On the top of my mind are:

  • Concerns about this from the product team who will own the feature in the long term
  • Allocating QE and docs capacity for testing and documenting the new feature
  • E2E testing, we expect automated E2E testing for all features to prevent regressions in the feature in the future

I'll try and get a conversation going about next steps here, so lets pause the enhancement and feature gating conversation for the moment

// and an alias with the CLUSTER_DOMAIN suffix will also be added.
//
// +optional
// +kubebuilder:validation:MaxItems=20
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd like to get input from the network team about the appropriate maximum for this list. 20 sounds ok to me, but I wonder if we can reasonably expand this to account for a larger use case.

@Miciah could you or one of your team consider the implications of adding new entries, and what an appropriate maximum scale might be?

Comment on lines 98 to 102
// nodeServices specifies K8s services for which entries should be added
// to /etc/hosts by the node resolver. These services will be added in addition to
// the default "image-registry.openshift-image-registry.svc" service.
// For each service reference, entries will be created using the format "<name>.<namespace>.svc"
// and an alias with the CLUSTER_DOMAIN suffix will also be added.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
// nodeServices specifies K8s services for which entries should be added
// to /etc/hosts by the node resolver. These services will be added in addition to
// the default "image-registry.openshift-image-registry.svc" service.
// For each service reference, entries will be created using the format "<name>.<namespace>.svc"
// and an alias with the CLUSTER_DOMAIN suffix will also be added.
// nodeServices specifies a list of service objects for which host level resolvable entries should be added.
// Services in this list will be added to /etc/hosts on each node in the cluster by the node resolver.
// When not specified, only the default image registry service is resolvable.
// Services in this list will be added in addition to the default "image-registry.openshift-image-registry.svc" service.
// The default image registry service cannot be removed.
// For each service reference, entries will be created using the format "<name>.<namespace>.svc"
// and an alias with the CLUSTER_DOMAIN suffix will also be added.

Just want to add a little more context

and an alias with the CLUSTER_DOMAIN suffix will also be added

Which CLUSTER_DOMAIN is this?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

//
// +optional
// +kubebuilder:validation:MaxItems=20
NodeServices []DNSNodeService `json:"nodeServices,omitempty"`
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please also add

Suggested change
NodeServices []DNSNodeService `json:"nodeServices,omitempty"`
// +kubebuilder:validation:MinItems=1
// +listType=map
// +listMapKey=name
// +listMapKey=namespace
NodeServices []DNSNodeService `json:"nodeServices,omitempty"`

type DNSNodeService struct {
// name is the name of the service.
// +required
// +kubebuilder:validation:Required
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This line is redundant as you have +required above.

Suggested change
// +kubebuilder:validation:Required


// namespace is the namespace of the service.
// +required
// +kubebuilder:validation:Required
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This line is redundant as you have +required above.

Suggested change
// +kubebuilder:validation:Required

@@ -163,6 +173,25 @@ var (
DNSLogLevelTrace DNSLogLevel = "Trace"
)

// DNSNodeService represents a Kubernetes service by name and namespace for node services.
type DNSNodeService struct {
// name is the name of the service.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Service names are validated as DNS 1035 labels, so we should explain that here

Suggested change
// name is the name of the service.
// name is the name of the service.
// The name should consist of at most 63 characters, and of only lowercase alphanumeric characters and hyphens,
// and should start with an alphabetic character and end with an alphanumeric character.

// +kubebuilder:validation:Required
// +kubebuilder:validation:MinLength=1
// +kubebuilder:validation:MaxLength=63
// +kubebuilder:validation:Pattern=`^[a-z]([-a-z0-9]*[a-z0-9])?$`
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We prefer to use CEL as opposed to a pattern here

Suggested change
// +kubebuilder:validation:Pattern=`^[a-z]([-a-z0-9]*[a-z0-9])?$`
// +kubebuilder:validation:XValidation:rule=`!format.dns1035Label().validate(self).hasValue()`,message="a DNS-1035 label must consist of lower case alphanumeric characters or '-', start with an alphabetic character, and end with an alphanumeric character"

// +kubebuilder:validation:Pattern=`^[a-z]([-a-z0-9]*[a-z0-9])?$`
Name string `json:"name"`

// namespace is the namespace of the service.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Namespace names are DNS 1123 Labels, so we will explain that here too

Suggested change
// namespace is the namespace of the service.
// namespace is the namespace of the service.
// The namespace should consist of at most 63 characters, and of only lowercase alphanumeric characters and hyphens,
// and should start and end with an alphanumeric character.

// +kubebuilder:validation:Required
// +kubebuilder:validation:MinLength=1
// +kubebuilder:validation:MaxLength=63
// +kubebuilder:validation:Pattern=`^[a-z]([-a-z0-9]*[a-z0-9])?$`
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Prefer CEL to pattern here

Suggested change
// +kubebuilder:validation:Pattern=`^[a-z]([-a-z0-9]*[a-z0-9])?$`
// +kubebuilder:validation:XValidation:rule=`!format.dns1123Label().validate(self).hasValue()`,message="the value must consist of only lowercase alphanumeric characters and hyphens"

@t-cas
Copy link
Author

t-cas commented Aug 8, 2025

@JoelSpeed thanks for all the feedback, PR has been updated to integrate all your comments

Please tell me for the next steps how should I proceed with that change

Copy link
Contributor

@JoelSpeed JoelSpeed left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @t-cas, next steps here are that I'd like to discuss with @Miciah, however, he is currently on PTO this week, so I'll check in with him next week

// Services in this list will be added in addition to the default "image-registry.openshift-image-registry.svc" service.
// The default image registry service cannot be removed.
// For each service reference, entries will be created using the format "<name>.<namespace>.svc"
// and an alias with the CLUSTER_DOMAIN suffix of cluster.local will also be added.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I suspect we probably want to shorten this and drop CLUSTER_DOMAIN if it is not configurable, CC @Miciah to confirm

@Miciah
Copy link
Contributor

Miciah commented Aug 13, 2025

/assign

@candita
Copy link
Contributor

candita commented Aug 13, 2025

/assign @alebedev87

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
needs-ok-to-test Indicates a PR that requires an org member to verify it is safe to test. size/L Denotes a PR that changes 100-499 lines, ignoring generated files.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants