Skip to content

Conversation

@shanduur
Copy link
Member

@shanduur shanduur commented Jan 8, 2026

Enables support for trustd behind load balancer by providing SNI.

(cherry picked from commit b593bc6)
(cherry picked from commit 33f336d)

@github-project-automation github-project-automation bot moved this to To Do in Planning Jan 8, 2026
@shanduur shanduur force-pushed the feat/grpc-authority branch 2 times, most recently from 2aea006 to 288dc2d Compare January 8, 2026 10:34
@shanduur shanduur force-pushed the feat/grpc-authority branch from 288dc2d to 9443235 Compare January 8, 2026 10:44
))

grpcOpts := []grpc.DialOption{
grpc.WithAuthority(ParseAuthority(host)),
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this wouldn't work, as we have a list of endpoints, and pass a single authority with it.

Either the round-robin should return a proper authority, or something else should be going on here.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@smira is there any difference in the way Talos populates the value for the address parameter for the NewConnection method depending on the number of endpoints that would make this fail if there is more than one endpoint to balance the load with ?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm, reading through the code of the gRPC client, and looking at our code, we are setting the ServerName correctly in our resolver, so it should be propagated as intended:

Adding the WithAuthority here will override that - and this is to be done by the user. We need a new configuration (new document? field in existing document?) that will allow setting that field (?).

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'll try to describe the setup I've successfully tested this with.
I have an existing "admin" cluster made up of Talos nodes.

Within that cluster I have deployed :

  • a Gateway API (Traefik) with a NodePort service listening on ports 32443 and 32501
  • the Kamaji controller
  • a Kamaji TenantControlPlane named tenant-one with talos-csr-signer as an extra container (as described here)
  • gateway and routes to redirect requests for host api.tenant-one.ingress.example.com on both ports to either the Kamaji control plane or CSR signer for that tenant

I also have :

  • a reserved public IP
  • a wildcard DNS entry for *.ingress.example.com pointing to that IP
  • a load balancer attached to that IP, listening on ports 443 and 50001 and forwarding requests to the admin cluster nodes on the nodeports mentioned previously

I'm now trying to add Talos nodes to that tenant-one Kamaji control-plane with the following cluster configuration :

cluster:
  clusterName: tenant-one
  controlPlane:
    endpoint: https://api.tenant-one.ingress.example.com:443

With Talos v1.10.9 (without the proposed changes) when the node is initializing it sends a request to the resolved IP for api.tenant-one.ingress.example.com on port 50001. As that requests doesn't include the configured host name, the Gateway API controller doesn't present the right certificate and Talos can't validate the CA.

With a custom Talos image with the hard-coded stringapi.tenant-one.ingress.example.comas authority, the node successfully gets its certificate from the CSR signer.

Even if I haven't tested this setup with more than one IP resolving the host or with Talos control-plane nodes (and not Kamaji's CSR signer), I don't see why it would not work.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you provide a reproducer - minimal set of manifests that we can test this with? This might be easier than I initially thought.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's great that it works, but the problem with this change is that we have multiple addresses passed down to the gRPC client, and it iterates over them, while we set a single authority. I haven't looked into the code, but I think authority should come from the selected endpoint (not sure if it's possible with gRPC even), or this WithAuthority should be a special case when there is just a single endpoint to talk to.

As it reproduces (with pods in a K8S cluster) a regular setup with 3 control-plane nodes and a bunch of worker Nodes all running Talos images, that test actually also shows that it works even if we pass multiple addresses.

I had to run the test again as the pod logs had expired since my initial test, but here are the logs from one of the worker nodes :

[talos] 2026/01/14 15:43:04 Initializing CSR generator {"component": "controller-runtime", "controller": "secrets.APIController", "endpoints": ["10.244.1.217", "10.244.3.6", "10.244.3.7", "9XX.XXX.XXX.6"], "host": "talos-test.ingress.XXXX.net"}
[talos] 2026/01/14 15:43:04 sending CSR {"component": "controller-runtime", "controller": "secrets.APIController", "endpoints": ["10.244.1.217", "10.244.3.6", "10.244.3.7", "9XX.XXX.XXX.6"]}

You can clearly see the control-plane pod IPs and the external LB IP (resolved from the provided cluster endpoint in the Talos config) being passed as endpoints to the gRPC client.

I'm sorry but I don't see how this would be any different with VM or BM nodes.

Copy link

@smasset-orange smasset-orange Jan 14, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The only restriction for it to work might be that all control-plane nodes need to include the host used as gRPC authority in the machine certSANs

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The only restriction for it to work might be that all control-plane nodes need to include the host used as gRPC authority in the machine certSANs

Yes, exactly, and this is not the case for "bare" Talos, this is only the case for your setup.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The only restriction for it to work might be that all control-plane nodes need to include the host used as gRPC authority in the machine certSANs

Yes, exactly, and this is not the case for "bare" Talos, this is only the case for your setup.

I'm afraid this is not just my setup : it was built following Talos documentation guidelines :

When using a TCP loadbalancer, make sure the loadbalancer endpoint is included in the .machine.certSANs list in the machine configuration.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This talks about API access, that is apid, not trustd, and this has nothing to do with what I posted above.

Talos workers connect to trustds using direct connection by IP - that's the primary path.

@shanduur shanduur force-pushed the feat/grpc-authority branch 2 times, most recently from acf4f2b to 3bafbf6 Compare January 8, 2026 13:18
@shanduur shanduur changed the title feat: add authority in GRPC client initialization feat: override authority in GRPC client initialization Jan 13, 2026
Enables support for trustd behind load balancer by providing SNI.

(cherry picked from commit b593bc6)
(cherry picked from commit 33f336d)

Signed-off-by: Sébastien Masset <[email protected]>
Signed-off-by: Mateusz Urbanek <[email protected]>
@shanduur shanduur force-pushed the feat/grpc-authority branch from 3bafbf6 to de86fc3 Compare January 13, 2026 09:38
@smira
Copy link
Member

smira commented Jan 15, 2026

We did some testing with @shanduur against Talos API client with endpoints which uses same gRPC flow with endpoints and load-balancer.

And gRPC client sends SNI correctly given that the list of endpoints looks something like this:

endpoints:
  - some.hostname:port # SNI of some.hostname
  - some.hostname # SNI of some.hostname
  - IP_address # no SNI

So I'm not exactly sure what this PR is trying to address. E.g. in your setup, if the controlplane loadbalancer for Talos worker is set to https://example.com:6443, then Talos worker will try to connect example.com:50001 for trustd API.

@shanduur
Copy link
Member Author

Closing.

@shanduur shanduur closed this Jan 15, 2026
@github-project-automation github-project-automation bot moved this from To Do to Done in Planning Jan 15, 2026
@smasset-orange
Copy link

Hope it's OK to post comment after PR has been closed.

if the controlplane loadbalancer for Talos worker is set to https://example.com:6443, then Talos worker will try to connect example.com:50001 for trustd API.

In my experience (maybe this is something that has changed between main and release-1.10 branches), the Talos worker connects to trustd API on port 50001 using the IP example.com resolves to and not using the example..com hostname for the configured control-plane endpoint.

This is what makes whatever is behind the LB fail to route the request to the correct backend as it receives no SNI but only the IP (which in my case is not dedicated to a single cluster but shared among different clusters each identified by a different hostname but all resolving to the same IP).

I'll try again with the latest Talos image built from the main branch just to make sure this is not something that has been changed/fixed since v1.10.9.

@smira
Copy link
Member

smira commented Jan 15, 2026

You can see the endpoints via

$ talosctl -n 172.20.0.5 get endpoints 
NODE         NAMESPACE      TYPE       ID               VERSION   ADDRESSES
172.20.0.5   controlplane   Endpoint   controlplane     1         ["172.20.0.1"]
172.20.0.5   controlplane   Endpoint   discovery        2         ["172.20.0.2","172.20.0.3","172.20.0.4"]
172.20.0.5   controlplane   Endpoint   kube-apiserver   1         ["172.20.0.2","172.20.0.3","172.20.0.4"]

Talos 1.10 is out of any support and it shouldn't be used.

@evgkrsk
Copy link

evgkrsk commented Jan 15, 2026

We have same experience as @smasset-orange with talos v1.12.0 (and talos-csr-signer, yep). But in our case, talos worker refuses to join cluster if controlplane endpoint cert contains FQDN and 127.0.0.1 (with error failed to verity certificate: x509: certificate is valid for 127.0.0.1 not X.X.X.X). So we were force to add all controlplane IP's to cert.

@smasset-orange
Copy link

Same result with v1.13.0-alpha.0 as with 3 week old v1.10.9 : the worker nodes fail to join the cluster because they call the trustd endpoints by IP (9XX.XXX.XXX.6) and not with the hostname (https://talos-test.ingress.XXXX.net)

$ talosctl get endpoints
NODE           NAMESPACE      TYPE       ID               VERSION   ADDRESSES
10.244.3.9     controlplane   Endpoint   controlplane     1         ["9XX.XXX.XXX.6"]
10.244.3.9     controlplane   Endpoint   discovery        2         ["10.244.3.9"]
10.244.3.9     controlplane   Endpoint   kube-apiserver   3         ["10.244.1.219","10.244.2.50","10.244.3.9"]
10.244.1.219   controlplane   Endpoint   controlplane     1         ["9XX.XXX.XXX.6"]
10.244.1.219   controlplane   Endpoint   discovery        2         ["10.244.1.219"]
10.244.1.219   controlplane   Endpoint   kube-apiserver   3         ["10.244.1.219","10.244.2.50","10.244.3.9"]
10.244.2.50    controlplane   Endpoint   controlplane     1         ["9XX.XXX.XXX.6"]
10.244.2.50    controlplane   Endpoint   discovery        2         ["10.244.2.50"]
10.244.2.50    controlplane   Endpoint   kube-apiserver   3         ["10.244.1.219","10.244.2.50","10.244.3.9"]
2 errors occurred:
        * rpc error: code = Unavailable desc = connection error: desc = "transport: Error while dialing: dial tcp 10.244.1.220:50000: i/o timeout"
        * rpc error: code = Unavailable desc = connection error: desc = "transport: Error while dialing: dial tcp 10.244.2.51:50000: i/o timeout"

Here's the workers Talos config :

cluster:
    controlPlane:
        endpoint: https://talos-test.ingress.XXXX.net:443
    clusterName: talos-test

And some logs from the workers attempts to send CSRs :

[talos] 2026/01/15 16:23:53 sending CSR {"component": "controller-runtime", "controller": "secrets.APIController", "endpoints": ["9XX.XXX.XXX.6"]}
[...]
[talos] 2026/01/15 16:24:01 sending CSR {"component": "controller-runtime", "controller": "secrets.APIController", "endpoints": ["10.244.1.219", "10.244.2.50", "10.244.3.9", "9XX.XXX.XXX.6"]}
[...]
[talos] 2026/01/15 16:25:02 sending CSR {"component": "controller-runtime", "controller": "secrets.APIController", "endpoints": ["10.244.1.219", "10.244.2.50", "10.244.3.9", "9XX.XXX.XXX.6"]}
[...]
[talos] 2026/01/15 16:29:05 controller failed {"component": "controller-runtime", "controller": "secrets.APIController", "error": "failed to sign API server CSR: 6 error(s) occurred:\n\trpc error: code = DeadlineExceeded desc = context deadline exceeded while waiting for connections to become ready\n\trpc error: code = Unavailable desc = connection error: desc = \"transport: Error while dialing: dial tcp 10.244.2.50:50001: i/o timeout\"\n\trpc error: code = Unavailable desc = connection error: desc = \"transport: Error while dialing: dial tcp 10.244.3.9:50001: i/o timeout\"\n\trpc error: code = Unavailable desc = connection error: desc = \"transport: authentication handshake failed: tls: failed to verify certificate: x509: cannot validate certificate for 9XX.XXX.XXX.6 because it doesn't contain any IP SANs\"\n\trpc error: code = Unavailable desc = connection error: desc = \"transport: Error while dialing: dial tcp 10.244.1.219:50001: i/o timeout\"\n\ttimeout"}

The cert doesn't contain any IP SANs because it is not the right one because the request is sent to 9XX.XXX.XXX.6:50001 and not talos-test.ingress.XXXX.net:50001 (or to 9XX.XXX.XXX.6:50001 with talos-test.ingress.XXXX.net as authority).

@smira
Copy link
Member

smira commented Jan 15, 2026

Ok, this points towards a problem, but this is certainly not what this PR was trying to do.

@smira
Copy link
Member

smira commented Jan 15, 2026

Ok, this points towards a problem, but this is certainly not what this PR was trying to do.

This won't be an easy fix I guess, as internally Talos treats endpoints as IP addresses, I will do the fix, but it will probably stay for 1.13 only.

@smasset-orange
Copy link

Ok, this points towards a problem, but this is certainly not what this PR was trying to do.

I really don't mean to be rude but this is exactly what this PR (well that PR ) is trying to do.

In short, it turns curl https://9XX.XXX.XXX.6:50001 into curl -H 'Host: talos-test.ingress.XXXX.net' https://9XX.XXX.XXX.6:50001. Or, in gRPC terms, grpcurl '9XX.XXX.XXX.6:50001' into grpcurl -authority 'talos-test.ingress.XXXX.net' '9XX.XXX.XXX.6:50001'.

@smira
Copy link
Member

smira commented Jan 15, 2026

This PR was doing totally wrong thing which I pointed above multiple times - endpoints is a list, it's not a single hostname/IP whatever.

smira added a commit to smira/talos that referenced this pull request Jan 15, 2026
Populate endpoint coming from the Kubernetes controlplane endpoint with
the hostname (if the endpoint is a hostname).

This should improve cases when hostname is used for the endpoint in
terms of SNI, proper resolving of DNS if it's dynamic.

See siderolabs#12556 (comment)

Signed-off-by: Andrey Smirnov <[email protected]>
smira added a commit to smira/talos that referenced this pull request Jan 15, 2026
Populate endpoint coming from the Kubernetes controlplane endpoint with
the hostname (if the endpoint is a hostname).

This should improve cases when hostname is used for the endpoint in
terms of SNI, proper resolving of DNS if it's dynamic.

See siderolabs#12556 (comment)

Signed-off-by: Andrey Smirnov <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

Status: Done

Development

Successfully merging this pull request may close these issues.

4 participants