-
Notifications
You must be signed in to change notification settings - Fork 48
Planning reports #8631
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Planning reports #8631
Conversation
a596e95
to
e300707
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks Alex; I'm inclined to say we should err on the side of merging this as soon as we're reasonably happy with it as a starting point. We know we're going to want to iterate on it a fair bit, and it'd be good to get more people involved in that.
(Not saying we should merge as-is or without tests, just that it probably doesn't warrant the same level of caution as some other planner changes, and we know it's not going to be perfect and that's okay.)
BTreeMap<SledUuid, PlanningNoopImageSourceSkipSledReason>, | ||
pub skipped_zones: | ||
BTreeMap<OmicronZoneUuid, PlanningNoopImageSourceSkipZoneReason>, | ||
pub converted_zones: BTreeMap<SledUuid, (usize, usize)>, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nit - can we turn (usize, usize)
into a struct with field names? Just looking here, I don't know what these values are.
(Same for other (usize, usize)
tuples below)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It seems that JsonSchema
doesn't really like tuples either, so they were all removed in 3fa1a21.
WARN cannot issue more SP updates (no current artifacts) | ||
INFO all zones up-to-date | ||
INFO will ensure cockroachdb setting, setting: cluster.preserve_downgrade_option, value: DoNotModify |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Since we're losing all these logs, can reconfigurator-cli blueprint-plan
emit the planning report? I think being able to see it in the CI diffs would also help a lot with reviewing the Display
impls (and might give us ideas for thinks we should add).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I like that idea. It would also help in this PR to verify that the new reports include the same info that we had in the logs before.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks, f0aabe5 adds the report to the end of blueprint display, and also to the message returned by reconfigurator-cli::cmd_blueprint_plan
.
Ok(self.blueprint.build()) | ||
} | ||
|
||
fn check_input_validity(&self) -> Result<(), Error> { | ||
pub fn plan_and_report( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we need a separate method for this (as opposed to changing plan()
to return the blueprint and report)? do_plan
constructs the report anyway, so I'm not sure it's buying us much?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
f0aabe5 moves the report into the blueprint itself, so the need for this method goes away.
let sleds = match noop_info { | ||
NoopConvertInfo::GlobalEligible { sleds } => sleds, | ||
NoopConvertInfo::GlobalIneligible { .. } => return Ok(()), | ||
NoopConvertInfo::GlobalIneligible { .. } => return Ok(report), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can the report include some indication that no-op conversion is globally ineligible?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It definitely can and should, but I haven't converted the image_source
module yet (that's the second TODO in the description above), and so I'd rather hold off on that and do it all at once as a follow-up.
/// Attempts to place `num_zones_to_add` new zones of `kind`. | ||
/// | ||
/// It is not an error if there are too few eligible sleds to start a | ||
/// sufficient number of zones; instead, we'll log a warning and start as |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
/// sufficient number of zones; instead, we'll log a warning and start as | |
/// sufficient number of zones; instead, we'll report it and start as |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks, fixed in 6d5b564.
// Do not update any zones if we've added any discretionary zones | ||
// (e.g., in response to policy changes) ... | ||
if add.any_discretionary_zones_placed() { | ||
report.waiting_on(ZoneUpdatesWaitingOn::DiscretionaryZones); | ||
return Ok(report); | ||
} | ||
|
||
// ... or if there are still pending updates for the RoT / SP / | ||
// Host OS / etc. | ||
if mgs_updates.any_updates_pending() { | ||
report.waiting_on(ZoneUpdatesWaitingOn::PendingMgsUpdates); | ||
return Ok(report); | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think I'd move these checks back into do_plan
? Definitely a judgment call on clarity, but I'm still swayed by Dave's arguments. I think it'd be nice if "skip this entire step" was obviously visible from do_plan
. Even better if that means we can drop passing (some) earlier steps' reports into later steps' implementations (I know we'll still need some of these, but maybe not as many?).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sure, done in 6d5b564.
} | ||
|
||
// We are only interested in non-decommissioned sleds with | ||
// running NTP zones (TODO: check time sync). |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this new behavior, or is it moving something around?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I like this! I agree with @jgallagher that it makes sense to land this sooner rather than later and iterate in follow-on PRs. I don't have any blockers here. (I also didn't review the mechanics in planner.rs and planning_report.rs with a fine-tooth comb.)
WARN cannot issue more SP updates (no current artifacts) | ||
INFO all zones up-to-date | ||
INFO will ensure cockroachdb setting, setting: cluster.preserve_downgrade_option, value: DoNotModify |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I like that idea. It would also help in this PR to verify that the new reports include the same info that we had in the logs before.
INFO sufficient InternalDns zones exist in plan, desired_count: 3, current_count: 3 | ||
INFO sufficient ExternalDns zones exist in plan, desired_count: 3, current_count: 3 | ||
INFO sufficient Nexus zones exist in plan, desired_count: 3, current_count: 3 | ||
INFO sufficient Oximeter zones exist in plan, desired_count: 0, current_count: 0 | ||
WARN cannot issue more SP updates (no current artifacts) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I see that elsewhere some WARNs got removed, presumably in favor of a note in the report. Should this one too? (Fine if that's a follow-on PR.)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I haven't updated the mgs_updates
module for planning reports yet. I think I could take this with the follow-up for #8285.
)?; | ||
} | ||
|
||
// Very noisy in tests. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hmm. Is this useful or not? If not, we should just remove it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It doesn't seem so to me, but I lack context on its importance. Should we remove the commented-out printing only, or the field too?
|
||
#[derive(Clone, Debug, Deserialize, Serialize, PartialEq, Eq)] | ||
pub struct PlanningMgsUpdatesStepReport { | ||
pub pending_mgs_updates: PendingMgsUpdates, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I understand that this might be for a follow-up PR, I think I'd expect this to include a list of devices that remain to be updated but cannot be updated for some unexpected reason (missing image, missing inventory state for them, etc. -- these are the cases where nexus_reconfigurator_planning::mgs_updates::try_make_update()
returns None
after emitting a warning). That in turn will let us really fix #8285.
We might also want this to include:
- a list of updates that were determined to be completed
- a list of updates that were determined to be impossible and cleared
- a list of updates that we configured
I say "list" -- currently it can only be one, but the code doesn't assume that because it's a policy choice we've made for safety, not a limitation of the implementation.
This sounds like exactly the stuff @jgallagher suggests could go in a follow on PR (which makes sense to me).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That all sounds right to me, and I agree it should be a follow-up.
SledFilter::Commissioned"; | ||
"sled_id" => %sled_id, | ||
); | ||
report.zombie_sleds.push(sled_id); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is a nice example of the improvement this API brings!
"found sled missing NTP zone (will add one)"; | ||
"sled_id" => %sled_id | ||
); | ||
report.sleds_missing_ntp_zone.insert(sled_id); | ||
self.blueprint.record_operation(Operation::AddZone { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we think this new mechanism would eventually replace record_operation()
? (I didn't previously know about that.)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good question; I don't know. But I noticed the similarity, too.
/// when all planning steps are complete, but before the blueprint | ||
/// has been built (and so we don't yet know its ID). | ||
#[derive(Debug)] | ||
pub(crate) struct InterimPlanningReport { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It looks like the blueprint_id is assigned when we call BlueprintBuilder::build()
. That can only be called once. I wonder if we could remove this whole type (and this file) by generating the blueprint id when we call BlueprintBuilder::new_based_on()
instead and making it available to the caller via a new_blueprint_id()
method. Not a big deal either way.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We can indeed! Done in 48dce2a. It had a larger surface area than I would have liked because we now need to pass the PlannerRng
in to the builder, and so also to Planner::new_based_on
. This touched a lot of tests, but didn't really change anything.
Compute the new blueprint ID up front instead. Requires passing a PlannerRng to the initializers.
77a100a
to
955d5c2
Compare
955d5c2
to
bd0b6d5
Compare
Fixes #8284, fixes #8548.
Replaces many (but not yet all) planner log messages with entries in a structured planning report. Those reports are comprised of "step reports", each of which corresponds to a subroutine of
do_plan
. The complete report may be formatted (in English) via aDisplay
implementation. If planning from theblueprint_planner
background task is successful, a report is attached to theBlueprintPlannerStatus::Targeted
variant so that it may be displayed viaomdb
.TODO, to be handled as follow-ups:
NoopConvertInfo