Skip to content

Conversation

plotnick
Copy link
Contributor

@plotnick plotnick commented Jul 17, 2025

Fixes #8284, fixes #8548.

Replaces many (but not yet all) planner log messages with entries in a structured planning report. Those reports are comprised of "step reports", each of which corresponds to a subroutine of do_plan. The complete report may be formatted (in English) via a Display implementation. If planning from the blueprint_planner background task is successful, a report is attached to the BlueprintPlannerStatus::Targeted variant so that it may be displayed via omdb.

TODO, to be handled as follow-ups:

Copy link
Contributor

@jgallagher jgallagher left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks Alex; I'm inclined to say we should err on the side of merging this as soon as we're reasonably happy with it as a starting point. We know we're going to want to iterate on it a fair bit, and it'd be good to get more people involved in that.

(Not saying we should merge as-is or without tests, just that it probably doesn't warrant the same level of caution as some other planner changes, and we know it's not going to be perfect and that's okay.)

BTreeMap<SledUuid, PlanningNoopImageSourceSkipSledReason>,
pub skipped_zones:
BTreeMap<OmicronZoneUuid, PlanningNoopImageSourceSkipZoneReason>,
pub converted_zones: BTreeMap<SledUuid, (usize, usize)>,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit - can we turn (usize, usize) into a struct with field names? Just looking here, I don't know what these values are.

(Same for other (usize, usize) tuples below)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It seems that JsonSchema doesn't really like tuples either, so they were all removed in 3fa1a21.

WARN cannot issue more SP updates (no current artifacts)
INFO all zones up-to-date
INFO will ensure cockroachdb setting, setting: cluster.preserve_downgrade_option, value: DoNotModify
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since we're losing all these logs, can reconfigurator-cli blueprint-plan emit the planning report? I think being able to see it in the CI diffs would also help a lot with reviewing the Display impls (and might give us ideas for thinks we should add).

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I like that idea. It would also help in this PR to verify that the new reports include the same info that we had in the logs before.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks, f0aabe5 adds the report to the end of blueprint display, and also to the message returned by reconfigurator-cli::cmd_blueprint_plan.

Ok(self.blueprint.build())
}

fn check_input_validity(&self) -> Result<(), Error> {
pub fn plan_and_report(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we need a separate method for this (as opposed to changing plan() to return the blueprint and report)? do_plan constructs the report anyway, so I'm not sure it's buying us much?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

f0aabe5 moves the report into the blueprint itself, so the need for this method goes away.

let sleds = match noop_info {
NoopConvertInfo::GlobalEligible { sleds } => sleds,
NoopConvertInfo::GlobalIneligible { .. } => return Ok(()),
NoopConvertInfo::GlobalIneligible { .. } => return Ok(report),
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can the report include some indication that no-op conversion is globally ineligible?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It definitely can and should, but I haven't converted the image_source module yet (that's the second TODO in the description above), and so I'd rather hold off on that and do it all at once as a follow-up.

/// Attempts to place `num_zones_to_add` new zones of `kind`.
///
/// It is not an error if there are too few eligible sleds to start a
/// sufficient number of zones; instead, we'll log a warning and start as
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
/// sufficient number of zones; instead, we'll log a warning and start as
/// sufficient number of zones; instead, we'll report it and start as

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks, fixed in 6d5b564.

Comment on lines 1125 to 1137
// Do not update any zones if we've added any discretionary zones
// (e.g., in response to policy changes) ...
if add.any_discretionary_zones_placed() {
report.waiting_on(ZoneUpdatesWaitingOn::DiscretionaryZones);
return Ok(report);
}

// ... or if there are still pending updates for the RoT / SP /
// Host OS / etc.
if mgs_updates.any_updates_pending() {
report.waiting_on(ZoneUpdatesWaitingOn::PendingMgsUpdates);
return Ok(report);
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think I'd move these checks back into do_plan? Definitely a judgment call on clarity, but I'm still swayed by Dave's arguments. I think it'd be nice if "skip this entire step" was obviously visible from do_plan. Even better if that means we can drop passing (some) earlier steps' reports into later steps' implementations (I know we'll still need some of these, but maybe not as many?).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure, done in 6d5b564.

}

// We are only interested in non-decommissioned sleds with
// running NTP zones (TODO: check time sync).
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this new behavior, or is it moving something around?

Copy link
Contributor Author

@plotnick plotnick Jul 31, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was confused about #8353; reverted in 6d5b564. I also had some other bugs in my translation of how various NTP edge-cases were handled, which were fixed in bd0b6d5.

Copy link
Collaborator

@davepacheco davepacheco left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I like this! I agree with @jgallagher that it makes sense to land this sooner rather than later and iterate in follow-on PRs. I don't have any blockers here. (I also didn't review the mechanics in planner.rs and planning_report.rs with a fine-tooth comb.)

WARN cannot issue more SP updates (no current artifacts)
INFO all zones up-to-date
INFO will ensure cockroachdb setting, setting: cluster.preserve_downgrade_option, value: DoNotModify
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I like that idea. It would also help in this PR to verify that the new reports include the same info that we had in the logs before.

INFO sufficient InternalDns zones exist in plan, desired_count: 3, current_count: 3
INFO sufficient ExternalDns zones exist in plan, desired_count: 3, current_count: 3
INFO sufficient Nexus zones exist in plan, desired_count: 3, current_count: 3
INFO sufficient Oximeter zones exist in plan, desired_count: 0, current_count: 0
WARN cannot issue more SP updates (no current artifacts)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see that elsewhere some WARNs got removed, presumably in favor of a note in the report. Should this one too? (Fine if that's a follow-on PR.)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I haven't updated the mgs_updates module for planning reports yet. I think I could take this with the follow-up for #8285.

)?;
}

// Very noisy in tests.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm. Is this useful or not? If not, we should just remove it.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It doesn't seem so to me, but I lack context on its importance. Should we remove the commented-out printing only, or the field too?


#[derive(Clone, Debug, Deserialize, Serialize, PartialEq, Eq)]
pub struct PlanningMgsUpdatesStepReport {
pub pending_mgs_updates: PendingMgsUpdates,
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I understand that this might be for a follow-up PR, I think I'd expect this to include a list of devices that remain to be updated but cannot be updated for some unexpected reason (missing image, missing inventory state for them, etc. -- these are the cases where nexus_reconfigurator_planning::mgs_updates::try_make_update() returns None after emitting a warning). That in turn will let us really fix #8285.

We might also want this to include:

  • a list of updates that were determined to be completed
  • a list of updates that were determined to be impossible and cleared
  • a list of updates that we configured

I say "list" -- currently it can only be one, but the code doesn't assume that because it's a policy choice we've made for safety, not a limitation of the implementation.

This sounds like exactly the stuff @jgallagher suggests could go in a follow on PR (which makes sense to me).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That all sounds right to me, and I agree it should be a follow-up.

SledFilter::Commissioned";
"sled_id" => %sled_id,
);
report.zombie_sleds.push(sled_id);
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a nice example of the improvement this API brings!

"found sled missing NTP zone (will add one)";
"sled_id" => %sled_id
);
report.sleds_missing_ntp_zone.insert(sled_id);
self.blueprint.record_operation(Operation::AddZone {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we think this new mechanism would eventually replace record_operation()? (I didn't previously know about that.)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good question; I don't know. But I noticed the similarity, too.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this makes sense yeah -- record_operation just records when an operation was affirmatively performed, which seems like a subset of what the report does.

/// when all planning steps are complete, but before the blueprint
/// has been built (and so we don't yet know its ID).
#[derive(Debug)]
pub(crate) struct InterimPlanningReport {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It looks like the blueprint_id is assigned when we call BlueprintBuilder::build(). That can only be called once. I wonder if we could remove this whole type (and this file) by generating the blueprint id when we call BlueprintBuilder::new_based_on() instead and making it available to the caller via a new_blueprint_id() method. Not a big deal either way.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We can indeed! Done in 48dce2a. It had a larger surface area than I would have liked because we now need to pass the PlannerRng in to the builder, and so also to Planner::new_based_on. This touched a lot of tests, but didn't really change anything.

@plotnick plotnick marked this pull request as ready for review July 31, 2025 06:44
Copy link
Collaborator

@davepacheco davepacheco left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looking good! Thanks, and sorry for my slowness here.

* Noop converting 6/6 install-dataset zones to artifact store on sled 98e6b7c2-2efa-41ca-b20a-0a4d61102fe6
* Noop converting 5/6 install-dataset zones to artifact store on sled aff6c093-197d-42c5-ad80-9f10ba051a34
* 1 pending MGS update:
* model0:serial0: Sp { expected_active_version: ArtifactVersion("0.0.1"), expected_inactive_version: NoValidVersion }
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

At first I thought: "it's probably worth adding the new version here", then I thought "well, this isn't supposed to be a summary of everything done -- that's the blueprint itself. Maybe it should have even less?" I don't have a strong feeling about this.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think what to put in the report vs just in the blueprint (or its diff) is going to be a continuing tension. Since a report is (currently) tied to a particular planning run, they could be difficult to interpret without their blueprint (diff); on the other hand, we might want them to be more "standalone" so they can be interpreted by a developer or operator without requiring (possibly expunged) blueprints alongside them. We could decide as a matter of policy which way we'd prefer to go here, or just experiment and see what's useful and what's not.

@@ -422,6 +425,9 @@ parent: ad97e762-7bf1-45a6-a98f-60afb7e491c0
sled 2 model2 serial2 e3b0c44298fc1c149afbf4c8996fb92427ae41e4649b934ca495991b7852b855 1.1.0 Sp { expected_active_version: ArtifactVersion("1.0.0"), expected_inactive_version: Version(ArtifactVersion("1.0.1")) }


Nothing to report on planning for blueprint cca24b71-09b5-4042-9185-b33e9f2ebba0.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What does this message mean? Does it mean the blueprint is the same as the parent?

edit: I think it might also mean "we loaded this from the database and so have no report for it". Maybe this should say "Empty planning report ..."? That might avoid confusing someone who sees this and thinks there have been no changes, when it might just be that we loaded this from the database.

Copy link
Contributor Author

@plotnick plotnick Aug 6, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The intent here is to represent a report that is "empty" in the sense of "devoid of interesting content", and in particular that won't display anything. Previously these would display as just the header:

Planning report for blueprint {blueprint_id}:

I wanted to indicate the empty case explicitly, rather than just not printing anything below the header.

Successful planning runs with nontrivial diffs can result in these if nothing unusual or unexpected occurred. But see above re: tension; it's not clear not me whether the report should be "standalone", in which case we probably don't really want empty reports.

Regardless, "Empty planning report" is indeed more precise, wording updated in dcac3dd.

@@ -1683,6 +1684,9 @@ impl DataStore {
)?;
}

// FIXME: Once reports are stored in the database, read them out here.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you file an issue for this?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure: #8788.

pub enum ZoneUnsafeToShutdown {
Cockroachdb(CockroachdbUnsafeToShutdown),
Cockroachdb { reason: CockroachdbUnsafeToShutdown },
BoundaryNtp { total_boundary_ntp_zones: usize, synchronized_count: usize },
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think InternalDNS might belong here now.

Copy link
Contributor Author

@plotnick plotnick Aug 8, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Right, as of #8683. Updated in 370c959.

@plotnick
Copy link
Contributor Author

plotnick commented Aug 8, 2025

@jgallagher: You were right (as usual): leaving out a planning step report for do_plan_mupdate_override and just doing what it and do_plan_noop_image_source currently want with the NoopConvertInfo made the merge straightforward. The cleanup effort was, as predicted, tricky, and I will have to come back to it next week.

Copy link
Contributor

@sunshowers sunshowers left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A few comments but nothing blocking landing this -- can be addressed in followups if you think it makes sense.

Comment on lines 42 to 47
Planning report for blueprint 8da82a8e-bf97-4fbd-8ddd-9f6462732cf1:
Chicken switches:
add zones with mupdate override: false

* No zpools in service for NTP zones on sleds: 00320471-945d-413c-85e7-03e091a70b3c
* Discretionary zone placement waiting for NTP zones on sleds: 00320471-945d-413c-85e7-03e091a70b3c
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: could you make these start with lower case to match the other outputs we produce?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

sure, done in 7822a53.

Comment on lines 196 to 204
let add = if plan_mupdate_override_res.is_empty()
|| self.input.chicken_switches().add_zones_with_mupdate_override
{
// If do_plan_mupdate_override returns Waiting, we don't plan *any*
// additional steps until the system has recovered.
if let UpdateStepResult::ContinueToNextStep =
self.do_plan_mgs_updates()
{
self.do_plan_zone_updates()?;
}
}
self.do_plan_add(&mgs_updates)?
} else {
PlanningAddStepReport::waiting_on(
ZoneAddWaitingOn::MupdateOverrides,
)
};
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can we record the results of both plan_mupdate_override_res.is_empty() and the add_zones_with_mupdate_override chicken switch here? one of the things I tried to do was to add logging for when we would have ordinarily have not added zones, but the chicken switch made us do it anyway -- it can be confusing to figure out why a decision was made.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

no problem, done in d02feaf.

"found sled missing NTP zone (will add one)";
"sled_id" => %sled_id
);
report.sleds_missing_ntp_zone.insert(sled_id);
self.blueprint.record_operation(Operation::AddZone {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this makes sense yeah -- record_operation just records when an operation was affirmatively performed, which seems like a subset of what the report does.

@plotnick plotnick enabled auto-merge (squash) August 14, 2025 01:07
@plotnick plotnick merged commit fc12607 into main Aug 14, 2025
17 checks passed
@plotnick plotnick deleted the planning-reports branch August 14, 2025 02:51
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

planner: adding new zones needs to know whether all lower-level updates are complete planner needs way to report events that it's waiting for
4 participants