Skip to content

Planning reports #8631

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 8 commits into
base: main
Choose a base branch
from
Open

Planning reports #8631

wants to merge 8 commits into from

Conversation

plotnick
Copy link
Contributor

@plotnick plotnick commented Jul 17, 2025

Fixes #8284, fixes #8548.

Replaces many (but not yet all) planner log messages with entries in a structured planning report. Those reports are comprised of "step reports", each of which corresponds to a subroutine of do_plan. The complete report may be formatted (in English) via a Display implementation. If planning from the blueprint_planner background task is successful, a report is attached to the BlueprintPlannerStatus::Targeted variant so that it may be displayed via omdb.

TODO, to be handled as follow-ups:

Copy link
Contributor

@jgallagher jgallagher left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks Alex; I'm inclined to say we should err on the side of merging this as soon as we're reasonably happy with it as a starting point. We know we're going to want to iterate on it a fair bit, and it'd be good to get more people involved in that.

(Not saying we should merge as-is or without tests, just that it probably doesn't warrant the same level of caution as some other planner changes, and we know it's not going to be perfect and that's okay.)

BTreeMap<SledUuid, PlanningNoopImageSourceSkipSledReason>,
pub skipped_zones:
BTreeMap<OmicronZoneUuid, PlanningNoopImageSourceSkipZoneReason>,
pub converted_zones: BTreeMap<SledUuid, (usize, usize)>,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit - can we turn (usize, usize) into a struct with field names? Just looking here, I don't know what these values are.

(Same for other (usize, usize) tuples below)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It seems that JsonSchema doesn't really like tuples either, so they were all removed in 3fa1a21.

WARN cannot issue more SP updates (no current artifacts)
INFO all zones up-to-date
INFO will ensure cockroachdb setting, setting: cluster.preserve_downgrade_option, value: DoNotModify
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since we're losing all these logs, can reconfigurator-cli blueprint-plan emit the planning report? I think being able to see it in the CI diffs would also help a lot with reviewing the Display impls (and might give us ideas for thinks we should add).

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I like that idea. It would also help in this PR to verify that the new reports include the same info that we had in the logs before.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks, f0aabe5 adds the report to the end of blueprint display, and also to the message returned by reconfigurator-cli::cmd_blueprint_plan.

Ok(self.blueprint.build())
}

fn check_input_validity(&self) -> Result<(), Error> {
pub fn plan_and_report(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we need a separate method for this (as opposed to changing plan() to return the blueprint and report)? do_plan constructs the report anyway, so I'm not sure it's buying us much?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

f0aabe5 moves the report into the blueprint itself, so the need for this method goes away.

let sleds = match noop_info {
NoopConvertInfo::GlobalEligible { sleds } => sleds,
NoopConvertInfo::GlobalIneligible { .. } => return Ok(()),
NoopConvertInfo::GlobalIneligible { .. } => return Ok(report),
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can the report include some indication that no-op conversion is globally ineligible?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It definitely can and should, but I haven't converted the image_source module yet (that's the second TODO in the description above), and so I'd rather hold off on that and do it all at once as a follow-up.

/// Attempts to place `num_zones_to_add` new zones of `kind`.
///
/// It is not an error if there are too few eligible sleds to start a
/// sufficient number of zones; instead, we'll log a warning and start as
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
/// sufficient number of zones; instead, we'll log a warning and start as
/// sufficient number of zones; instead, we'll report it and start as

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks, fixed in 6d5b564.

Comment on lines 1125 to 1137
// Do not update any zones if we've added any discretionary zones
// (e.g., in response to policy changes) ...
if add.any_discretionary_zones_placed() {
report.waiting_on(ZoneUpdatesWaitingOn::DiscretionaryZones);
return Ok(report);
}

// ... or if there are still pending updates for the RoT / SP /
// Host OS / etc.
if mgs_updates.any_updates_pending() {
report.waiting_on(ZoneUpdatesWaitingOn::PendingMgsUpdates);
return Ok(report);
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think I'd move these checks back into do_plan? Definitely a judgment call on clarity, but I'm still swayed by Dave's arguments. I think it'd be nice if "skip this entire step" was obviously visible from do_plan. Even better if that means we can drop passing (some) earlier steps' reports into later steps' implementations (I know we'll still need some of these, but maybe not as many?).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure, done in 6d5b564.

}

// We are only interested in non-decommissioned sleds with
// running NTP zones (TODO: check time sync).
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this new behavior, or is it moving something around?

Copy link
Contributor Author

@plotnick plotnick Jul 31, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was confused about #8353; reverted in 6d5b564. I also had some other bugs in my translation of how various NTP edge-cases were handled, which were fixed in bd0b6d5.

Copy link
Collaborator

@davepacheco davepacheco left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I like this! I agree with @jgallagher that it makes sense to land this sooner rather than later and iterate in follow-on PRs. I don't have any blockers here. (I also didn't review the mechanics in planner.rs and planning_report.rs with a fine-tooth comb.)

WARN cannot issue more SP updates (no current artifacts)
INFO all zones up-to-date
INFO will ensure cockroachdb setting, setting: cluster.preserve_downgrade_option, value: DoNotModify
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I like that idea. It would also help in this PR to verify that the new reports include the same info that we had in the logs before.

INFO sufficient InternalDns zones exist in plan, desired_count: 3, current_count: 3
INFO sufficient ExternalDns zones exist in plan, desired_count: 3, current_count: 3
INFO sufficient Nexus zones exist in plan, desired_count: 3, current_count: 3
INFO sufficient Oximeter zones exist in plan, desired_count: 0, current_count: 0
WARN cannot issue more SP updates (no current artifacts)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see that elsewhere some WARNs got removed, presumably in favor of a note in the report. Should this one too? (Fine if that's a follow-on PR.)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I haven't updated the mgs_updates module for planning reports yet. I think I could take this with the follow-up for #8285.

)?;
}

// Very noisy in tests.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm. Is this useful or not? If not, we should just remove it.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It doesn't seem so to me, but I lack context on its importance. Should we remove the commented-out printing only, or the field too?


#[derive(Clone, Debug, Deserialize, Serialize, PartialEq, Eq)]
pub struct PlanningMgsUpdatesStepReport {
pub pending_mgs_updates: PendingMgsUpdates,
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I understand that this might be for a follow-up PR, I think I'd expect this to include a list of devices that remain to be updated but cannot be updated for some unexpected reason (missing image, missing inventory state for them, etc. -- these are the cases where nexus_reconfigurator_planning::mgs_updates::try_make_update() returns None after emitting a warning). That in turn will let us really fix #8285.

We might also want this to include:

  • a list of updates that were determined to be completed
  • a list of updates that were determined to be impossible and cleared
  • a list of updates that we configured

I say "list" -- currently it can only be one, but the code doesn't assume that because it's a policy choice we've made for safety, not a limitation of the implementation.

This sounds like exactly the stuff @jgallagher suggests could go in a follow on PR (which makes sense to me).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That all sounds right to me, and I agree it should be a follow-up.

SledFilter::Commissioned";
"sled_id" => %sled_id,
);
report.zombie_sleds.push(sled_id);
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a nice example of the improvement this API brings!

"found sled missing NTP zone (will add one)";
"sled_id" => %sled_id
);
report.sleds_missing_ntp_zone.insert(sled_id);
self.blueprint.record_operation(Operation::AddZone {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we think this new mechanism would eventually replace record_operation()? (I didn't previously know about that.)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good question; I don't know. But I noticed the similarity, too.

/// when all planning steps are complete, but before the blueprint
/// has been built (and so we don't yet know its ID).
#[derive(Debug)]
pub(crate) struct InterimPlanningReport {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It looks like the blueprint_id is assigned when we call BlueprintBuilder::build(). That can only be called once. I wonder if we could remove this whole type (and this file) by generating the blueprint id when we call BlueprintBuilder::new_based_on() instead and making it available to the caller via a new_blueprint_id() method. Not a big deal either way.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We can indeed! Done in 48dce2a. It had a larger surface area than I would have liked because we now need to pass the PlannerRng in to the builder, and so also to Planner::new_based_on. This touched a lot of tests, but didn't really change anything.

@plotnick plotnick marked this pull request as ready for review July 31, 2025 06:44
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

planner: adding new zones needs to know whether all lower-level updates are complete planner needs way to report events that it's waiting for
3 participants