Planning reports #8631

plotnick · 2025-07-17T21:37:54Z

Fixes #8284, fixes #8548.

Replaces many (but not yet all) planner log messages with entries in a structured planning report. Those reports are comprised of "step reports", each of which corresponds to a subroutine of do_plan. The complete report may be formatted (in English) via a Display implementation. If planning from the blueprint_planner background task is successful, a report is attached to the BlueprintPlannerStatus::Targeted variant so that it may be displayed via omdb.

TODO, to be handled as follow-ups:

Update for the rest of [29/n] [reconfigurator] separate out noop image source decision making #8596, in particular NoopConvertInfo
Update for [40/n] blueprint planner logic + sled agent code to honor mupdate overrides #8456
Update for Use timesync status to decide when to upgrade boundary NTP zones #8616
Update for [reconfigurator] RoT planner support #8421

jgallagher

Thanks Alex; I'm inclined to say we should err on the side of merging this as soon as we're reasonably happy with it as a starting point. We know we're going to want to iterate on it a fair bit, and it'd be good to get more people involved in that.

(Not saying we should merge as-is or without tests, just that it probably doesn't warrant the same level of caution as some other planner changes, and we know it's not going to be perfect and that's okay.)

jgallagher · 2025-07-18T15:30:21Z

nexus/types/src/deployment/planning_report.rs

+        BTreeMap<SledUuid, PlanningNoopImageSourceSkipSledReason>,
+    pub skipped_zones:
+        BTreeMap<OmicronZoneUuid, PlanningNoopImageSourceSkipZoneReason>,
+    pub converted_zones: BTreeMap<SledUuid, (usize, usize)>,


Nit - can we turn (usize, usize) into a struct with field names? Just looking here, I don't know what these values are.

(Same for other (usize, usize) tuples below)

It seems that JsonSchema doesn't really like tuples either, so they were all removed in 3fa1a21.

jgallagher · 2025-07-23T14:51:01Z

dev-tools/reconfigurator-cli/tests/output/cmds-add-sled-no-disks-stdout

 WARN cannot issue more SP updates (no current artifacts)
-INFO all zones up-to-date
-INFO will ensure cockroachdb setting, setting: cluster.preserve_downgrade_option, value: DoNotModify


Since we're losing all these logs, can reconfigurator-cli blueprint-plan emit the planning report? I think being able to see it in the CI diffs would also help a lot with reviewing the Display impls (and might give us ideas for thinks we should add).

I like that idea. It would also help in this PR to verify that the new reports include the same info that we had in the logs before.

Thanks, f0aabe5 adds the report to the end of blueprint display, and also to the message returned by reconfigurator-cli::cmd_blueprint_plan.

jgallagher · 2025-07-23T14:57:09Z

nexus/reconfigurator/planning/src/planner.rs

        Ok(self.blueprint.build())
    }

-    fn check_input_validity(&self) -> Result<(), Error> {
+    pub fn plan_and_report(


Do we need a separate method for this (as opposed to changing plan() to return the blueprint and report)? do_plan constructs the report anyway, so I'm not sure it's buying us much?

f0aabe5 moves the report into the blueprint itself, so the need for this method goes away.

jgallagher · 2025-07-23T14:59:10Z

nexus/reconfigurator/planning/src/planner.rs

        let sleds = match noop_info {
            NoopConvertInfo::GlobalEligible { sleds } => sleds,
-            NoopConvertInfo::GlobalIneligible { .. } => return Ok(()),
+            NoopConvertInfo::GlobalIneligible { .. } => return Ok(report),


Can the report include some indication that no-op conversion is globally ineligible?

It definitely can and should, but I haven't converted the image_source module yet (that's the second TODO in the description above), and so I'd rather hold off on that and do it all at once as a follow-up.

jgallagher · 2025-07-23T15:02:31Z

nexus/reconfigurator/planning/src/planner.rs

+    /// Attempts to place `num_zones_to_add` new zones of `kind`.
+    ///
+    /// It is not an error if there are too few eligible sleds to start a
+    /// sufficient number of zones; instead, we'll log a warning and start as


Suggested change

/// sufficient number of zones; instead, we'll log a warning and start as

/// sufficient number of zones; instead, we'll report it and start as

Thanks, fixed in 6d5b564.

jgallagher · 2025-07-23T15:05:13Z

nexus/reconfigurator/planning/src/planner.rs

+        // Do not update any zones if we've added any discretionary zones
+        // (e.g., in response to policy changes) ...
+        if add.any_discretionary_zones_placed() {
+            report.waiting_on(ZoneUpdatesWaitingOn::DiscretionaryZones);
+            return Ok(report);
+        }
+
+        // ... or if there are still pending updates for the RoT / SP /
+        // Host OS / etc.
+        if mgs_updates.any_updates_pending() {
+            report.waiting_on(ZoneUpdatesWaitingOn::PendingMgsUpdates);
+            return Ok(report);
+        }


I think I'd move these checks back into do_plan? Definitely a judgment call on clarity, but I'm still swayed by Dave's arguments. I think it'd be nice if "skip this entire step" was obviously visible from do_plan. Even better if that means we can drop passing (some) earlier steps' reports into later steps' implementations (I know we'll still need some of these, but maybe not as many?).

Sure, done in 6d5b564.

jgallagher · 2025-07-23T15:06:14Z

nexus/reconfigurator/planning/src/planner.rs

+        }
+
+        // We are only interested in non-decommissioned sleds with
+        // running NTP zones (TODO: check time sync).


Is this new behavior, or is it moving something around?

I was confused about #8353; reverted in 6d5b564. I also had some other bugs in my translation of how various NTP edge-cases were handled, which were fixed in bd0b6d5.

davepacheco

I like this! I agree with @jgallagher that it makes sense to land this sooner rather than later and iterate in follow-on PRs. I don't have any blockers here. (I also didn't review the mechanics in planner.rs and planning_report.rs with a fine-tooth comb.)

davepacheco · 2025-07-23T17:24:36Z

dev-tools/reconfigurator-cli/tests/output/cmds-add-sled-no-disks-stdout

 WARN cannot issue more SP updates (no current artifacts)
-INFO all zones up-to-date
-INFO will ensure cockroachdb setting, setting: cluster.preserve_downgrade_option, value: DoNotModify


I like that idea. It would also help in this PR to verify that the new reports include the same info that we had in the logs before.

davepacheco · 2025-07-23T17:26:33Z

dev-tools/reconfigurator-cli/tests/output/cmds-add-sled-no-disks-stdout

-INFO sufficient InternalDns zones exist in plan, desired_count: 3, current_count: 3
-INFO sufficient ExternalDns zones exist in plan, desired_count: 3, current_count: 3
-INFO sufficient Nexus zones exist in plan, desired_count: 3, current_count: 3
-INFO sufficient Oximeter zones exist in plan, desired_count: 0, current_count: 0
 WARN cannot issue more SP updates (no current artifacts)


I see that elsewhere some WARNs got removed, presumably in favor of a note in the report. Should this one too? (Fine if that's a follow-on PR.)

I haven't updated the mgs_updates module for planning reports yet. I think I could take this with the follow-up for #8285.

davepacheco · 2025-07-23T17:32:53Z

nexus/types/src/deployment/planning_report.rs

+            )?;
+        }
+
+        // Very noisy in tests.


Hmm. Is this useful or not? If not, we should just remove it.

It doesn't seem so to me, but I lack context on its importance. Should we remove the commented-out printing only, or the field too?

davepacheco · 2025-07-23T17:44:05Z

nexus/types/src/deployment/planning_report.rs

+
+#[derive(Clone, Debug, Deserialize, Serialize, PartialEq, Eq)]
+pub struct PlanningMgsUpdatesStepReport {
+    pub pending_mgs_updates: PendingMgsUpdates,


I understand that this might be for a follow-up PR, I think I'd expect this to include a list of devices that remain to be updated but cannot be updated for some unexpected reason (missing image, missing inventory state for them, etc. -- these are the cases where nexus_reconfigurator_planning::mgs_updates::try_make_update() returns None after emitting a warning). That in turn will let us really fix #8285.

We might also want this to include:

a list of updates that were determined to be completed

a list of updates that were determined to be impossible and cleared

a list of updates that we configured

I say "list" -- currently it can only be one, but the code doesn't assume that because it's a policy choice we've made for safety, not a limitation of the implementation.

This sounds like exactly the stuff @jgallagher suggests could go in a follow on PR (which makes sense to me).

That all sounds right to me, and I agree it should be a follow-up.

davepacheco · 2025-07-23T17:56:39Z

nexus/reconfigurator/planning/src/planner.rs

-                         SledFilter::Commissioned";
-                        "sled_id" => %sled_id,
-                    );
+                    report.zombie_sleds.push(sled_id);


This is a nice example of the improvement this API brings!

davepacheco · 2025-07-23T17:59:21Z

nexus/reconfigurator/planning/src/planner.rs

-                    "found sled missing NTP zone (will add one)";
-                    "sled_id" => %sled_id
-                );
+                report.sleds_missing_ntp_zone.insert(sled_id);
                self.blueprint.record_operation(Operation::AddZone {


Do we think this new mechanism would eventually replace record_operation()? (I didn't previously know about that.)

Good question; I don't know. But I noticed the similarity, too.

I think this makes sense yeah -- record_operation just records when an operation was affirmatively performed, which seems like a subset of what the report does.

davepacheco · 2025-07-23T18:08:17Z

nexus/reconfigurator/planning/src/reports.rs

+/// when all planning steps are complete, but before the blueprint
+/// has been built (and so we don't yet know its ID).
+#[derive(Debug)]
+pub(crate) struct InterimPlanningReport {


It looks like the blueprint_id is assigned when we call BlueprintBuilder::build(). That can only be called once. I wonder if we could remove this whole type (and this file) by generating the blueprint id when we call BlueprintBuilder::new_based_on() instead and making it available to the caller via a new_blueprint_id() method. Not a big deal either way.

We can indeed! Done in 48dce2a. It had a larger surface area than I would have liked because we now need to pass the PlannerRng in to the builder, and so also to Planner::new_based_on. This touched a lot of tests, but didn't really change anything.

Compute the new blueprint ID up front instead. Requires passing a PlannerRng to the initializers.

davepacheco

Looking good! Thanks, and sorry for my slowness here.

davepacheco · 2025-08-05T16:26:04Z

dev-tools/reconfigurator-cli/tests/output/cmds-noop-image-source-stdout

+* Noop converting 6/6 install-dataset zones to artifact store on sled 98e6b7c2-2efa-41ca-b20a-0a4d61102fe6
+* Noop converting 5/6 install-dataset zones to artifact store on sled aff6c093-197d-42c5-ad80-9f10ba051a34
+* 1 pending MGS update:
+  * model0:serial0: Sp { expected_active_version: ArtifactVersion("0.0.1"), expected_inactive_version: NoValidVersion }


At first I thought: "it's probably worth adding the new version here", then I thought "well, this isn't supposed to be a summary of everything done -- that's the blueprint itself. Maybe it should have even less?" I don't have a strong feeling about this.

I think what to put in the report vs just in the blueprint (or its diff) is going to be a continuing tension. Since a report is (currently) tied to a particular planning run, they could be difficult to interpret without their blueprint (diff); on the other hand, we might want them to be more "standalone" so they can be interpreted by a developer or operator without requiring (possibly expunged) blueprints alongside them. We could decide as a matter of policy which way we'd prefer to go here, or just experiment and see what's useful and what's not.

davepacheco · 2025-08-05T16:26:50Z

dev-tools/reconfigurator-cli/tests/output/cmds-set-mgs-updates-stdout

@@ -422,6 +425,9 @@ parent:    ad97e762-7bf1-45a6-a98f-60afb7e491c0
    sled      2      model2        serial2         e3b0c44298fc1c149afbf4c8996fb92427ae41e4649b934ca495991b7852b855   1.1.0              Sp { expected_active_version: ArtifactVersion("1.0.0"), expected_inactive_version: Version(ArtifactVersion("1.0.1")) }


+Nothing to report on planning for blueprint cca24b71-09b5-4042-9185-b33e9f2ebba0.


What does this message mean? Does it mean the blueprint is the same as the parent?

edit: I think it might also mean "we loaded this from the database and so have no report for it". Maybe this should say "Empty planning report ..."? That might avoid confusing someone who sees this and thinks there have been no changes, when it might just be that we loaded this from the database.

The intent here is to represent a report that is "empty" in the sense of "devoid of interesting content", and in particular that won't display anything. Previously these would display as just the header:

Planning report for blueprint {blueprint_id}:

I wanted to indicate the empty case explicitly, rather than just not printing anything below the header.

Successful planning runs with nontrivial diffs can result in these if nothing unusual or unexpected occurred. But see above re: tension; it's not clear not me whether the report should be "standalone", in which case we probably don't really want empty reports.

Regardless, "Empty planning report" is indeed more precise, wording updated in dcac3dd.

davepacheco · 2025-08-05T16:28:13Z

nexus/db-queries/src/db/datastore/deployment.rs

@@ -1683,6 +1684,9 @@ impl DataStore {
            )?;
        }

+        // FIXME: Once reports are stored in the database, read them out here.


Can you file an issue for this?

Sure: #8788.

davepacheco · 2025-08-05T16:36:57Z

nexus/types/src/deployment/planning_report.rs

 pub enum ZoneUnsafeToShutdown {
-    Cockroachdb(CockroachdbUnsafeToShutdown),
+    Cockroachdb { reason: CockroachdbUnsafeToShutdown },
+    BoundaryNtp { total_boundary_ntp_zones: usize, synchronized_count: usize },


I think InternalDNS might belong here now.

Right, as of #8683. Updated in 370c959.

plotnick · 2025-08-08T01:57:57Z

@jgallagher: You were right (as usual): leaving out a planning step report for do_plan_mupdate_override and just doing what it and do_plan_noop_image_source currently want with the NoopConvertInfo made the merge straightforward. The cleanup effort was, as predicted, tricky, and I will have to come back to it next week.

sunshowers

A few comments but nothing blocking landing this -- can be addressed in followups if you think it makes sense.

sunshowers · 2025-08-12T23:29:15Z

dev-tools/reconfigurator-cli/tests/output/cmds-add-sled-no-disks-stdout

+Planning report for blueprint 8da82a8e-bf97-4fbd-8ddd-9f6462732cf1:
+Chicken switches:
+    add zones with mupdate override:   false
+
+* No zpools in service for NTP zones on sleds: 00320471-945d-413c-85e7-03e091a70b3c
+* Discretionary zone placement waiting for NTP zones on sleds: 00320471-945d-413c-85e7-03e091a70b3c


nit: could you make these start with lower case to match the other outputs we produce?

sure, done in 7822a53.

sunshowers · 2025-08-13T19:26:23Z

nexus/reconfigurator/planning/src/planner.rs

+        let add = if plan_mupdate_override_res.is_empty()
+            || self.input.chicken_switches().add_zones_with_mupdate_override
        {
-            // If do_plan_mupdate_override returns Waiting, we don't plan *any*
-            // additional steps until the system has recovered.
-            if let UpdateStepResult::ContinueToNextStep =
-                self.do_plan_mgs_updates()
-            {
-                self.do_plan_zone_updates()?;
-            }
-        }
+            self.do_plan_add(&mgs_updates)?
+        } else {
+            PlanningAddStepReport::waiting_on(
+                ZoneAddWaitingOn::MupdateOverrides,
+            )
+        };


can we record the results of both plan_mupdate_override_res.is_empty() and the add_zones_with_mupdate_override chicken switch here? one of the things I tried to do was to add logging for when we would have ordinarily have not added zones, but the chicken switch made us do it anyway -- it can be confusing to figure out why a decision was made.

no problem, done in d02feaf.

sunshowers · 2025-08-13T19:28:22Z

nexus/reconfigurator/planning/src/planner.rs

-                    "found sled missing NTP zone (will add one)";
-                    "sled_id" => %sled_id
-                );
+                report.sleds_missing_ntp_zone.insert(sled_id);
                self.blueprint.record_operation(Operation::AddZone {


I think this makes sense yeah -- record_operation just records when an operation was affirmatively performed, which seems like a subset of what the report does.

plotnick added 2 commits July 17, 2025 16:01

Planning reports

8f30bd5

omdb support for planning reports

e300707

plotnick force-pushed the planning-reports branch from a596e95 to e300707 Compare July 17, 2025 22:01

expectorate

2feacef

jgallagher reviewed Jul 23, 2025

View reviewed changes

davepacheco reviewed Jul 23, 2025

View reviewed changes

plotnick added 4 commits July 29, 2025 20:30

Kill InterimPlanningReport

48dce2a

Compute the new blueprint ID up front instead. Requires passing a PlannerRng to the initializers.

Store planning reports in blueprints

f0aabe5

OpenAPI compatible planning reports

3fa1a21

Hoist zone update skipping logic into do_plan

6d5b564

plotnick force-pushed the planning-reports branch from 77a100a to 955d5c2 Compare July 31, 2025 01:33

Merge branch 'main' into planning-reports

bd0b6d5

plotnick force-pushed the planning-reports branch from 955d5c2 to bd0b6d5 Compare July 31, 2025 04:20

plotnick marked this pull request as ready for review July 31, 2025 06:44

davepacheco approved these changes Aug 5, 2025

View reviewed changes

"Empty planning report ..."

dcac3dd

plotnick mentioned this pull request Aug 6, 2025

Should persist blueprint reports #8788

Open

Merge branch 'main' into planning-reports

370c959

plotnick force-pushed the planning-reports branch from 5e1a746 to 370c959 Compare August 8, 2025 01:54

Merge branch 'main' into planning-reports

6b07842

sunshowers approved these changes Aug 13, 2025

View reviewed changes

plotnick added 3 commits August 13, 2025 15:07

lowercase

7822a53

More MUPdate info in add step report

d02feaf

Merge branch 'main' into planning-reports

6afa047

plotnick enabled auto-merge (squash) August 14, 2025 01:07

plotnick merged commit fc12607 into main Aug 14, 2025
17 checks passed

plotnick deleted the planning-reports branch August 14, 2025 02:51

karencfv mentioned this pull request Aug 14, 2025

[reconfigurator-cli] Populate simulated caboose and test RoT update #8835

Merged

	/// sufficient number of zones; instead, we'll log a warning and start as
	/// sufficient number of zones; instead, we'll report it and start as

		@@ -422,6 +425,9 @@ parent: ad97e762-7bf1-45a6-a98f-60afb7e491c0
		sled 2 model2 serial2 e3b0c44298fc1c149afbf4c8996fb92427ae41e4649b934ca495991b7852b855 1.1.0 Sp { expected_active_version: ArtifactVersion("1.0.0"), expected_inactive_version: Version(ArtifactVersion("1.0.1")) }


		Nothing to report on planning for blueprint cca24b71-09b5-4042-9185-b33e9f2ebba0.

Planning reports #8631

Planning reports #8631

Uh oh!

Conversation

plotnick commented Jul 17, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

jgallagher left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

plotnick Jul 31, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

davepacheco left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

davepacheco left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

plotnick commented Jul 17, 2025 •

edited

Loading

plotnick Jul 31, 2025 •

edited

Loading

plotnick Aug 6, 2025 •

edited

Loading

plotnick Aug 8, 2025 •

edited

Loading