Skip to content

Inventory collector: rely less on DNS #8671

@jgallagher

Description

@jgallagher

Today, collecting inventory starts by querying DNS for several services we want to interrogate and then querying the database for sled-agents to interrotate, then performing a collection with all of those targets identified. This has a couple not-so-nice properties:

  • For a few of the service types, we're using lookup_all_socket_and_zone_v6(), which returns both the socket address and the zone ID from the DNS resolver. It determines the zone ID by parsing the hostname returned by the initial SRV query. This is correct given the way we build DNS today, but feels fragile w.r.t. changes on the DNS side that one might assume would be unrelated; e.g., the SRV record could return IPs or we could change the way we define zone hostnames. (The decision to parse hostnames came out of conversations about how to tie a service with something the planner can use to identify it in the blueprint.)
  • For each of these service types, we'll only query targets found in DNS. If blueprint execution has gone haywire in some way that either affects DNS (e.g., we haven't been able to push updated DNS records in a while) or running zones (e.g., some sled is still running zones that should have been expunged), we'll fail to collect from any services that aren't in DNS. Since inventory's job is to try to present the system as it is, this isn't ideal.

A proposed change:

  • Don't query DNS for any discretionary services that are contained in the blueprint. (We'll still query DNS for MGS; this is probably fine? We could consider giving MGS the same treatment if that's wrong.)
  • Instead, after we collect inventory from each sled-agent, look at the OmicronSledConfigs we collected from them, and use that to determine which cockroach / NTP / etc. zones exist in the system instead of DNS; then build clients for them and issue status queries as we do today. (There's a question here of whether to use the ledgered sled config or the most-recently-reconciled sled config or both; I think I'd suggest both to handle cases where we just recently ledgered a different config and it hasn't yet been reconciled?)

@davepacheco, @smklein, @plotnick and I chatted about this in today's update watercooler.

I don't think this is a blocker for R17, but is some tech debt that seems like it will be fairly easy to clean up if we can get to it.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions