Inventory collector: rely less on DNS

Today, collecting inventory starts by [querying DNS for several services we want to interrogate](https://github.com/oxidecomputer/omicron/blob/9e642c9aabb2531e59103473dcc432c378240a86/nexus/src/app/background/tasks/inventory_collection.rs#L138-L242) and then [querying the database for sled-agents to interrotate](https://github.com/oxidecomputer/omicron/blob/9e642c9aabb2531e59103473dcc432c378240a86/nexus/src/app/background/tasks/inventory_collection.rs#L244-L245), then performing a collection with all of those targets identified. This has a couple not-so-nice properties:

* For a few of the service types, we're using `lookup_all_socket_and_zone_v6()`, which returns both the socket address and the zone ID from the DNS resolver. It determines the zone ID by parsing the hostname returned by the initial SRV query. This is correct given the way we build DNS today, but feels fragile w.r.t. changes on the DNS side that one might assume would be unrelated; e.g., the SRV record could return IPs or we could change the way we define zone hostnames. (The decision to parse hostnames came out of conversations about [how to tie a service with something the planner can use to identify it in the blueprint](https://github.com/oxidecomputer/omicron/pull/8603#discussion_r2216440186).)
* For each of these service types, we'll only query targets _found in DNS_. If blueprint execution has gone haywire in some way that either affects DNS (e.g., we haven't been able to push updated DNS records in a while) or running zones (e.g., some sled is still running zones that should have been expunged), we'll fail to collect from any services that aren't in DNS. Since inventory's job is to try to present the system as it is, this isn't ideal.

A proposed change:

* Don't query DNS for any discretionary services that are contained in the blueprint. (We'll still query DNS for MGS; this is probably fine? We could consider giving MGS the same treatment if that's wrong.)
* Instead, after we collect inventory from each sled-agent, look at the `OmicronSledConfig`s we collected from them, and use that to determine which cockroach / NTP / etc. zones exist in the system instead of DNS; then build clients for them and issue status queries as we do today. (There's a question here of whether to use the ledgered sled config or the most-recently-reconciled sled config or both; I think I'd suggest both to handle cases where we just recently ledgered a different config and it hasn't yet been reconciled?)

@davepacheco, @smklein, @plotnick and I chatted about this in today's update watercooler.

I don't think this is a blocker for R17, but is some tech debt that seems like it will be fairly easy to clean up if we can get to it.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Inventory collector: rely less on DNS #8671

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Inventory collector: rely less on DNS #8671

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions