feat: override authority in GRPC client initialization #12556

shanduur · 2026-01-08T10:25:14Z

Enables support for trustd behind load balancer by providing SNI.

(cherry picked from commit b593bc6)
(cherry picked from commit 33f336d)

pkg/grpc/middleware/auth/basic/basic.go

smira · 2026-01-08T10:45:15Z

pkg/grpc/middleware/auth/basic/basic.go

 	))

 	grpcOpts := []grpc.DialOption{
+		grpc.WithAuthority(ParseAuthority(host)),


I think this wouldn't work, as we have a list of endpoints, and pass a single authority with it.

Either the round-robin should return a proper authority, or something else should be going on here.

@smira is there any difference in the way Talos populates the value for the address parameter for the NewConnection method depending on the number of endpoints that would make this fail if there is more than one endpoint to balance the load with ?

Hmm, reading through the code of the gRPC client, and looking at our code, we are setting the ServerName correctly in our resolver, so it should be propagated as intended:

https://github.com/siderolabs/talos/blob/v1.12.1/pkg/machinery/client/resolver/roundrobin.go#L81-L84

https://github.com/grpc/grpc-go/blob/v1.77.0/clientconn.go#L1024-L1042

Adding the WithAuthority here will override that - and this is to be done by the user. We need a new configuration (new document? field in existing document?) that will allow setting that field (?).

I'll try to describe the setup I've successfully tested this with.
I have an existing "admin" cluster made up of Talos nodes.

Within that cluster I have deployed :

a Gateway API (Traefik) with a NodePort service listening on ports 32443 and 32501

the Kamaji controller

a Kamaji TenantControlPlane named tenant-one with talos-csr-signer as an extra container (as described here)

gateway and routes to redirect requests for host api.tenant-one.ingress.example.com on both ports to either the Kamaji control plane or CSR signer for that tenant

I also have :

a reserved public IP

a wildcard DNS entry for *.ingress.example.com pointing to that IP

a load balancer attached to that IP, listening on ports 443 and 50001 and forwarding requests to the admin cluster nodes on the nodeports mentioned previously

I'm now trying to add Talos nodes to that tenant-one Kamaji control-plane with the following cluster configuration :

cluster: clusterName: tenant-one controlPlane: endpoint: https://api.tenant-one.ingress.example.com:443

With Talos v1.10.9 (without the proposed changes) when the node is initializing it sends a request to the resolved IP for api.tenant-one.ingress.example.com on port 50001. As that requests doesn't include the configured host name, the Gateway API controller doesn't present the right certificate and Talos can't validate the CA.

With a custom Talos image with the hard-coded stringapi.tenant-one.ingress.example.comas authority, the node successfully gets its certificate from the CSR signer.

Even if I haven't tested this setup with more than one IP resolving the host or with Talos control-plane nodes (and not Kamaji's CSR signer), I don't see why it would not work.

Can you provide a reproducer - minimal set of manifests that we can test this with? This might be easier than I initially thought.

It's great that it works, but the problem with this change is that we have multiple addresses passed down to the gRPC client, and it iterates over them, while we set a single authority. I haven't looked into the code, but I think authority should come from the selected endpoint (not sure if it's possible with gRPC even), or this WithAuthority should be a special case when there is just a single endpoint to talk to.

As it reproduces (with pods in a K8S cluster) a regular setup with 3 control-plane nodes and a bunch of worker Nodes all running Talos images, that test actually also shows that it works even if we pass multiple addresses.

I had to run the test again as the pod logs had expired since my initial test, but here are the logs from one of the worker nodes :

[talos] 2026/01/14 15:43:04 Initializing CSR generator {"component": "controller-runtime", "controller": "secrets.APIController", "endpoints": ["10.244.1.217", "10.244.3.6", "10.244.3.7", "9XX.XXX.XXX.6"], "host": "talos-test.ingress.XXXX.net"} [talos] 2026/01/14 15:43:04 sending CSR {"component": "controller-runtime", "controller": "secrets.APIController", "endpoints": ["10.244.1.217", "10.244.3.6", "10.244.3.7", "9XX.XXX.XXX.6"]}

You can clearly see the control-plane pod IPs and the external LB IP (resolved from the provided cluster endpoint in the Talos config) being passed as endpoints to the gRPC client.

I'm sorry but I don't see how this would be any different with VM or BM nodes.

The only restriction for it to work might be that all control-plane nodes need to include the host used as gRPC authority in the machine certSANs

The only restriction for it to work might be that all control-plane nodes need to include the host used as gRPC authority in the machine certSANs

Yes, exactly, and this is not the case for "bare" Talos, this is only the case for your setup.

The only restriction for it to work might be that all control-plane nodes need to include the host used as gRPC authority in the machine certSANs

Yes, exactly, and this is not the case for "bare" Talos, this is only the case for your setup.

I'm afraid this is not just my setup : it was built following Talos documentation guidelines :

When using a TCP loadbalancer, make sure the loadbalancer endpoint is included in the .machine.certSANs list in the machine configuration.

This talks about API access, that is apid, not trustd, and this has nothing to do with what I posted above.

Talos workers connect to trustds using direct connection by IP - that's the primary path.

Enables support for trustd behind load balancer by providing SNI. (cherry picked from commit b593bc6) (cherry picked from commit 33f336d) Signed-off-by: Sébastien Masset <[email protected]> Signed-off-by: Mateusz Urbanek <[email protected]>

smira · 2026-01-15T11:49:05Z

We did some testing with @shanduur against Talos API client with endpoints which uses same gRPC flow with endpoints and load-balancer.

And gRPC client sends SNI correctly given that the list of endpoints looks something like this:

endpoints:
  - some.hostname:port # SNI of some.hostname
  - some.hostname # SNI of some.hostname
  - IP_address # no SNI

So I'm not exactly sure what this PR is trying to address. E.g. in your setup, if the controlplane loadbalancer for Talos worker is set to https://example.com:6443, then Talos worker will try to connect example.com:50001 for trustd API.

shanduur · 2026-01-15T12:14:50Z

Closing.

smasset-orange · 2026-01-15T16:07:39Z

Hope it's OK to post comment after PR has been closed.

if the controlplane loadbalancer for Talos worker is set to https://example.com:6443, then Talos worker will try to connect example.com:50001 for trustd API.

In my experience (maybe this is something that has changed between main and release-1.10 branches), the Talos worker connects to trustd API on port 50001 using the IP example.com resolves to and not using the example..com hostname for the configured control-plane endpoint.

This is what makes whatever is behind the LB fail to route the request to the correct backend as it receives no SNI but only the IP (which in my case is not dedicated to a single cluster but shared among different clusters each identified by a different hostname but all resolving to the same IP).

I'll try again with the latest Talos image built from the main branch just to make sure this is not something that has been changed/fixed since v1.10.9.

smira · 2026-01-15T16:12:00Z

You can see the endpoints via

$ talosctl -n 172.20.0.5 get endpoints 
NODE         NAMESPACE      TYPE       ID               VERSION   ADDRESSES
172.20.0.5   controlplane   Endpoint   controlplane     1         ["172.20.0.1"]
172.20.0.5   controlplane   Endpoint   discovery        2         ["172.20.0.2","172.20.0.3","172.20.0.4"]
172.20.0.5   controlplane   Endpoint   kube-apiserver   1         ["172.20.0.2","172.20.0.3","172.20.0.4"]

Talos 1.10 is out of any support and it shouldn't be used.

evgkrsk · 2026-01-15T16:24:49Z

We have same experience as @smasset-orange with talos v1.12.0 (and talos-csr-signer, yep). But in our case, talos worker refuses to join cluster if controlplane endpoint cert contains FQDN and 127.0.0.1 (with error failed to verity certificate: x509: certificate is valid for 127.0.0.1 not X.X.X.X). So we were force to add all controlplane IP's to cert.

smasset-orange · 2026-01-15T16:52:00Z

Same result with v1.13.0-alpha.0 as with 3 week old v1.10.9 : the worker nodes fail to join the cluster because they call the trustd endpoints by IP (9XX.XXX.XXX.6) and not with the hostname (https://talos-test.ingress.XXXX.net)

$ talosctl get endpoints
NODE           NAMESPACE      TYPE       ID               VERSION   ADDRESSES
10.244.3.9     controlplane   Endpoint   controlplane     1         ["9XX.XXX.XXX.6"]
10.244.3.9     controlplane   Endpoint   discovery        2         ["10.244.3.9"]
10.244.3.9     controlplane   Endpoint   kube-apiserver   3         ["10.244.1.219","10.244.2.50","10.244.3.9"]
10.244.1.219   controlplane   Endpoint   controlplane     1         ["9XX.XXX.XXX.6"]
10.244.1.219   controlplane   Endpoint   discovery        2         ["10.244.1.219"]
10.244.1.219   controlplane   Endpoint   kube-apiserver   3         ["10.244.1.219","10.244.2.50","10.244.3.9"]
10.244.2.50    controlplane   Endpoint   controlplane     1         ["9XX.XXX.XXX.6"]
10.244.2.50    controlplane   Endpoint   discovery        2         ["10.244.2.50"]
10.244.2.50    controlplane   Endpoint   kube-apiserver   3         ["10.244.1.219","10.244.2.50","10.244.3.9"]
2 errors occurred:
        * rpc error: code = Unavailable desc = connection error: desc = "transport: Error while dialing: dial tcp 10.244.1.220:50000: i/o timeout"
        * rpc error: code = Unavailable desc = connection error: desc = "transport: Error while dialing: dial tcp 10.244.2.51:50000: i/o timeout"

Here's the workers Talos config :

cluster:
    controlPlane:
        endpoint: https://talos-test.ingress.XXXX.net:443
    clusterName: talos-test

And some logs from the workers attempts to send CSRs :

[talos] 2026/01/15 16:23:53 sending CSR {"component": "controller-runtime", "controller": "secrets.APIController", "endpoints": ["9XX.XXX.XXX.6"]}
[...]
[talos] 2026/01/15 16:24:01 sending CSR {"component": "controller-runtime", "controller": "secrets.APIController", "endpoints": ["10.244.1.219", "10.244.2.50", "10.244.3.9", "9XX.XXX.XXX.6"]}
[...]
[talos] 2026/01/15 16:25:02 sending CSR {"component": "controller-runtime", "controller": "secrets.APIController", "endpoints": ["10.244.1.219", "10.244.2.50", "10.244.3.9", "9XX.XXX.XXX.6"]}
[...]
[talos] 2026/01/15 16:29:05 controller failed {"component": "controller-runtime", "controller": "secrets.APIController", "error": "failed to sign API server CSR: 6 error(s) occurred:\n\trpc error: code = DeadlineExceeded desc = context deadline exceeded while waiting for connections to become ready\n\trpc error: code = Unavailable desc = connection error: desc = \"transport: Error while dialing: dial tcp 10.244.2.50:50001: i/o timeout\"\n\trpc error: code = Unavailable desc = connection error: desc = \"transport: Error while dialing: dial tcp 10.244.3.9:50001: i/o timeout\"\n\trpc error: code = Unavailable desc = connection error: desc = \"transport: authentication handshake failed: tls: failed to verify certificate: x509: cannot validate certificate for 9XX.XXX.XXX.6 because it doesn't contain any IP SANs\"\n\trpc error: code = Unavailable desc = connection error: desc = \"transport: Error while dialing: dial tcp 10.244.1.219:50001: i/o timeout\"\n\ttimeout"}

The cert doesn't contain any IP SANs because it is not the right one because the request is sent to 9XX.XXX.XXX.6:50001 and not talos-test.ingress.XXXX.net:50001 (or to 9XX.XXX.XXX.6:50001 with talos-test.ingress.XXXX.net as authority).

smira · 2026-01-15T16:57:38Z

Ok, this points towards a problem, but this is certainly not what this PR was trying to do.

smira · 2026-01-15T17:01:34Z

Ok, this points towards a problem, but this is certainly not what this PR was trying to do.

This won't be an easy fix I guess, as internally Talos treats endpoints as IP addresses, I will do the fix, but it will probably stay for 1.13 only.

smasset-orange · 2026-01-15T18:29:02Z

Ok, this points towards a problem, but this is certainly not what this PR was trying to do.

I really don't mean to be rude but this is exactly what this PR (well that PR ) is trying to do.

In short, it turns curl https://9XX.XXX.XXX.6:50001 into curl -H 'Host: talos-test.ingress.XXXX.net' https://9XX.XXX.XXX.6:50001. Or, in gRPC terms, grpcurl '9XX.XXX.XXX.6:50001' into grpcurl -authority 'talos-test.ingress.XXXX.net' '9XX.XXX.XXX.6:50001'.

smira · 2026-01-15T18:34:57Z

This PR was doing totally wrong thing which I pointed above multiple times - endpoints is a list, it's not a single hostname/IP whatever.

Populate endpoint coming from the Kubernetes controlplane endpoint with the hostname (if the endpoint is a hostname). This should improve cases when hostname is used for the endpoint in terms of SNI, proper resolving of DNS if it's dynamic. See siderolabs#12556 (comment) Signed-off-by: Andrey Smirnov <[email protected]>

github-project-automation bot added this to Planning Jan 8, 2026

github-project-automation bot moved this to To Do in Planning Jan 8, 2026

shanduur force-pushed the feat/grpc-authority branch 2 times, most recently from 2aea006 to 288dc2d Compare January 8, 2026 10:34

smira reviewed Jan 8, 2026

View reviewed changes

pkg/grpc/middleware/auth/basic/basic.go Outdated Show resolved Hide resolved

shanduur force-pushed the feat/grpc-authority branch from 288dc2d to 9443235 Compare January 8, 2026 10:44

smira reviewed Jan 8, 2026

View reviewed changes

shanduur force-pushed the feat/grpc-authority branch 2 times, most recently from acf4f2b to 3bafbf6 Compare January 8, 2026 13:18

shanduur changed the title ~~feat: add authority in GRPC client initialization~~ feat: override authority in GRPC client initialization Jan 13, 2026

shanduur force-pushed the feat/grpc-authority branch from 3bafbf6 to de86fc3 Compare January 13, 2026 09:38

shanduur closed this Jan 15, 2026

github-project-automation bot moved this from To Do to Done in Planning Jan 15, 2026

smira mentioned this pull request Jan 15, 2026

fix: add hostname to endpoints #12603

Merged

Uh oh!

feat: override authority in GRPC client initialization #12556

feat: override authority in GRPC client initialization #12556

Conversation

shanduur commented Jan 8, 2026

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

smasset-orange Jan 14, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

smira commented Jan 15, 2026

Uh oh!

shanduur commented Jan 15, 2026

Uh oh!

smasset-orange commented Jan 15, 2026

Uh oh!

smira commented Jan 15, 2026

Uh oh!

evgkrsk commented Jan 15, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

smasset-orange commented Jan 15, 2026

Uh oh!

smira commented Jan 15, 2026

Uh oh!

smira commented Jan 15, 2026

Uh oh!

smasset-orange commented Jan 15, 2026

Uh oh!

smira commented Jan 15, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

smasset-orange Jan 14, 2026 •

edited

Loading

evgkrsk commented Jan 15, 2026 •

edited

Loading