-
-
Notifications
You must be signed in to change notification settings - Fork 772
feat: override authority in GRPC client initialization #12556
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
2aea006 to
288dc2d
Compare
288dc2d to
9443235
Compare
| )) | ||
|
|
||
| grpcOpts := []grpc.DialOption{ | ||
| grpc.WithAuthority(ParseAuthority(host)), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this wouldn't work, as we have a list of endpoints, and pass a single authority with it.
Either the round-robin should return a proper authority, or something else should be going on here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@smira is there any difference in the way Talos populates the value for the address parameter for the NewConnection method depending on the number of endpoints that would make this fail if there is more than one endpoint to balance the load with ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hmm, reading through the code of the gRPC client, and looking at our code, we are setting the ServerName correctly in our resolver, so it should be propagated as intended:
- https://github.com/siderolabs/talos/blob/v1.12.1/pkg/machinery/client/resolver/roundrobin.go#L81-L84
- https://github.com/grpc/grpc-go/blob/v1.77.0/clientconn.go#L1024-L1042
Adding the WithAuthority here will override that - and this is to be done by the user. We need a new configuration (new document? field in existing document?) that will allow setting that field (?).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'll try to describe the setup I've successfully tested this with.
I have an existing "admin" cluster made up of Talos nodes.
Within that cluster I have deployed :
- a Gateway API (Traefik) with a NodePort service listening on ports 32443 and 32501
- the Kamaji controller
- a Kamaji TenantControlPlane named
tenant-onewith talos-csr-signer as an extra container (as described here) - gateway and routes to redirect requests for host
api.tenant-one.ingress.example.comon both ports to either the Kamaji control plane or CSR signer for that tenant
I also have :
- a reserved public IP
- a wildcard DNS entry for
*.ingress.example.compointing to that IP - a load balancer attached to that IP, listening on ports 443 and 50001 and forwarding requests to the admin cluster nodes on the nodeports mentioned previously
I'm now trying to add Talos nodes to that tenant-one Kamaji control-plane with the following cluster configuration :
cluster:
clusterName: tenant-one
controlPlane:
endpoint: https://api.tenant-one.ingress.example.com:443
With Talos v1.10.9 (without the proposed changes) when the node is initializing it sends a request to the resolved IP for api.tenant-one.ingress.example.com on port 50001. As that requests doesn't include the configured host name, the Gateway API controller doesn't present the right certificate and Talos can't validate the CA.
With a custom Talos image with the hard-coded stringapi.tenant-one.ingress.example.comas authority, the node successfully gets its certificate from the CSR signer.
Even if I haven't tested this setup with more than one IP resolving the host or with Talos control-plane nodes (and not Kamaji's CSR signer), I don't see why it would not work.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you provide a reproducer - minimal set of manifests that we can test this with? This might be easier than I initially thought.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's great that it works, but the problem with this change is that we have multiple addresses passed down to the gRPC client, and it iterates over them, while we set a single authority. I haven't looked into the code, but I think authority should come from the selected endpoint (not sure if it's possible with gRPC even), or this
WithAuthorityshould be a special case when there is just a single endpoint to talk to.
As it reproduces (with pods in a K8S cluster) a regular setup with 3 control-plane nodes and a bunch of worker Nodes all running Talos images, that test actually also shows that it works even if we pass multiple addresses.
I had to run the test again as the pod logs had expired since my initial test, but here are the logs from one of the worker nodes :
[talos] 2026/01/14 15:43:04 Initializing CSR generator {"component": "controller-runtime", "controller": "secrets.APIController", "endpoints": ["10.244.1.217", "10.244.3.6", "10.244.3.7", "9XX.XXX.XXX.6"], "host": "talos-test.ingress.XXXX.net"}
[talos] 2026/01/14 15:43:04 sending CSR {"component": "controller-runtime", "controller": "secrets.APIController", "endpoints": ["10.244.1.217", "10.244.3.6", "10.244.3.7", "9XX.XXX.XXX.6"]}
You can clearly see the control-plane pod IPs and the external LB IP (resolved from the provided cluster endpoint in the Talos config) being passed as endpoints to the gRPC client.
I'm sorry but I don't see how this would be any different with VM or BM nodes.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The only restriction for it to work might be that all control-plane nodes need to include the host used as gRPC authority in the machine certSANs
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The only restriction for it to work might be that all control-plane nodes need to include the host used as gRPC authority in the machine certSANs
Yes, exactly, and this is not the case for "bare" Talos, this is only the case for your setup.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The only restriction for it to work might be that all control-plane nodes need to include the host used as gRPC authority in the machine certSANs
Yes, exactly, and this is not the case for "bare" Talos, this is only the case for your setup.
I'm afraid this is not just my setup : it was built following Talos documentation guidelines :
When using a TCP loadbalancer, make sure the loadbalancer endpoint is included in the .machine.certSANs list in the machine configuration.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This talks about API access, that is apid, not trustd, and this has nothing to do with what I posted above.
Talos workers connect to trustds using direct connection by IP - that's the primary path.
acf4f2b to
3bafbf6
Compare
Enables support for trustd behind load balancer by providing SNI. (cherry picked from commit b593bc6) (cherry picked from commit 33f336d) Signed-off-by: Sébastien Masset <[email protected]> Signed-off-by: Mateusz Urbanek <[email protected]>
3bafbf6 to
de86fc3
Compare
|
We did some testing with @shanduur against Talos API client with endpoints which uses same gRPC flow with endpoints and load-balancer. And gRPC client sends SNI correctly given that the list of endpoints looks something like this: So I'm not exactly sure what this PR is trying to address. E.g. in your setup, if the controlplane loadbalancer for Talos worker is set to |
|
Closing. |
|
Hope it's OK to post comment after PR has been closed.
In my experience (maybe this is something that has changed between This is what makes whatever is behind the LB fail to route the request to the correct backend as it receives no SNI but only the IP (which in my case is not dedicated to a single cluster but shared among different clusters each identified by a different hostname but all resolving to the same IP). I'll try again with the latest Talos image built from the main branch just to make sure this is not something that has been changed/fixed since |
|
You can see the endpoints via Talos 1.10 is out of any support and it shouldn't be used. |
|
We have same experience as @smasset-orange with talos v1.12.0 (and talos-csr-signer, yep). But in our case, talos worker refuses to join cluster if controlplane endpoint cert contains FQDN and 127.0.0.1 (with error failed to verity certificate: x509: certificate is valid for 127.0.0.1 not X.X.X.X). So we were force to add all controlplane IP's to cert. |
|
Same result with Here's the workers Talos config : cluster:
controlPlane:
endpoint: https://talos-test.ingress.XXXX.net:443
clusterName: talos-testAnd some logs from the workers attempts to send CSRs : The cert |
|
Ok, this points towards a problem, but this is certainly not what this PR was trying to do. |
This won't be an easy fix I guess, as internally Talos treats endpoints as IP addresses, I will do the fix, but it will probably stay for 1.13 only. |
I really don't mean to be rude but this is exactly what this PR (well that PR ) is trying to do. In short, it turns |
|
This PR was doing totally wrong thing which I pointed above multiple times - endpoints is a list, it's not a single hostname/IP whatever. |
Populate endpoint coming from the Kubernetes controlplane endpoint with the hostname (if the endpoint is a hostname). This should improve cases when hostname is used for the endpoint in terms of SNI, proper resolving of DNS if it's dynamic. See siderolabs#12556 (comment) Signed-off-by: Andrey Smirnov <[email protected]>
Populate endpoint coming from the Kubernetes controlplane endpoint with the hostname (if the endpoint is a hostname). This should improve cases when hostname is used for the endpoint in terms of SNI, proper resolving of DNS if it's dynamic. See siderolabs#12556 (comment) Signed-off-by: Andrey Smirnov <[email protected]>
Enables support for trustd behind load balancer by providing SNI.
(cherry picked from commit b593bc6)
(cherry picked from commit 33f336d)