-
Notifications
You must be signed in to change notification settings - Fork 53
Planning reports #8631
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Planning reports #8631
Conversation
a596e95
to
e300707
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks Alex; I'm inclined to say we should err on the side of merging this as soon as we're reasonably happy with it as a starting point. We know we're going to want to iterate on it a fair bit, and it'd be good to get more people involved in that.
(Not saying we should merge as-is or without tests, just that it probably doesn't warrant the same level of caution as some other planner changes, and we know it's not going to be perfect and that's okay.)
BTreeMap<SledUuid, PlanningNoopImageSourceSkipSledReason>, | ||
pub skipped_zones: | ||
BTreeMap<OmicronZoneUuid, PlanningNoopImageSourceSkipZoneReason>, | ||
pub converted_zones: BTreeMap<SledUuid, (usize, usize)>, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nit - can we turn (usize, usize)
into a struct with field names? Just looking here, I don't know what these values are.
(Same for other (usize, usize)
tuples below)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It seems that JsonSchema
doesn't really like tuples either, so they were all removed in 3fa1a21.
WARN cannot issue more SP updates (no current artifacts) | ||
INFO all zones up-to-date | ||
INFO will ensure cockroachdb setting, setting: cluster.preserve_downgrade_option, value: DoNotModify |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Since we're losing all these logs, can reconfigurator-cli blueprint-plan
emit the planning report? I think being able to see it in the CI diffs would also help a lot with reviewing the Display
impls (and might give us ideas for thinks we should add).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I like that idea. It would also help in this PR to verify that the new reports include the same info that we had in the logs before.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks, f0aabe5 adds the report to the end of blueprint display, and also to the message returned by reconfigurator-cli::cmd_blueprint_plan
.
Ok(self.blueprint.build()) | ||
} | ||
|
||
fn check_input_validity(&self) -> Result<(), Error> { | ||
pub fn plan_and_report( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we need a separate method for this (as opposed to changing plan()
to return the blueprint and report)? do_plan
constructs the report anyway, so I'm not sure it's buying us much?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
f0aabe5 moves the report into the blueprint itself, so the need for this method goes away.
let sleds = match noop_info { | ||
NoopConvertInfo::GlobalEligible { sleds } => sleds, | ||
NoopConvertInfo::GlobalIneligible { .. } => return Ok(()), | ||
NoopConvertInfo::GlobalIneligible { .. } => return Ok(report), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can the report include some indication that no-op conversion is globally ineligible?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It definitely can and should, but I haven't converted the image_source
module yet (that's the second TODO in the description above), and so I'd rather hold off on that and do it all at once as a follow-up.
/// Attempts to place `num_zones_to_add` new zones of `kind`. | ||
/// | ||
/// It is not an error if there are too few eligible sleds to start a | ||
/// sufficient number of zones; instead, we'll log a warning and start as |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
/// sufficient number of zones; instead, we'll log a warning and start as | |
/// sufficient number of zones; instead, we'll report it and start as |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks, fixed in 6d5b564.
// Do not update any zones if we've added any discretionary zones | ||
// (e.g., in response to policy changes) ... | ||
if add.any_discretionary_zones_placed() { | ||
report.waiting_on(ZoneUpdatesWaitingOn::DiscretionaryZones); | ||
return Ok(report); | ||
} | ||
|
||
// ... or if there are still pending updates for the RoT / SP / | ||
// Host OS / etc. | ||
if mgs_updates.any_updates_pending() { | ||
report.waiting_on(ZoneUpdatesWaitingOn::PendingMgsUpdates); | ||
return Ok(report); | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think I'd move these checks back into do_plan
? Definitely a judgment call on clarity, but I'm still swayed by Dave's arguments. I think it'd be nice if "skip this entire step" was obviously visible from do_plan
. Even better if that means we can drop passing (some) earlier steps' reports into later steps' implementations (I know we'll still need some of these, but maybe not as many?).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sure, done in 6d5b564.
} | ||
|
||
// We are only interested in non-decommissioned sleds with | ||
// running NTP zones (TODO: check time sync). |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this new behavior, or is it moving something around?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I like this! I agree with @jgallagher that it makes sense to land this sooner rather than later and iterate in follow-on PRs. I don't have any blockers here. (I also didn't review the mechanics in planner.rs and planning_report.rs with a fine-tooth comb.)
WARN cannot issue more SP updates (no current artifacts) | ||
INFO all zones up-to-date | ||
INFO will ensure cockroachdb setting, setting: cluster.preserve_downgrade_option, value: DoNotModify |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I like that idea. It would also help in this PR to verify that the new reports include the same info that we had in the logs before.
INFO sufficient InternalDns zones exist in plan, desired_count: 3, current_count: 3 | ||
INFO sufficient ExternalDns zones exist in plan, desired_count: 3, current_count: 3 | ||
INFO sufficient Nexus zones exist in plan, desired_count: 3, current_count: 3 | ||
INFO sufficient Oximeter zones exist in plan, desired_count: 0, current_count: 0 | ||
WARN cannot issue more SP updates (no current artifacts) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I see that elsewhere some WARNs got removed, presumably in favor of a note in the report. Should this one too? (Fine if that's a follow-on PR.)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I haven't updated the mgs_updates
module for planning reports yet. I think I could take this with the follow-up for #8285.
)?; | ||
} | ||
|
||
// Very noisy in tests. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hmm. Is this useful or not? If not, we should just remove it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It doesn't seem so to me, but I lack context on its importance. Should we remove the commented-out printing only, or the field too?
|
||
#[derive(Clone, Debug, Deserialize, Serialize, PartialEq, Eq)] | ||
pub struct PlanningMgsUpdatesStepReport { | ||
pub pending_mgs_updates: PendingMgsUpdates, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I understand that this might be for a follow-up PR, I think I'd expect this to include a list of devices that remain to be updated but cannot be updated for some unexpected reason (missing image, missing inventory state for them, etc. -- these are the cases where nexus_reconfigurator_planning::mgs_updates::try_make_update()
returns None
after emitting a warning). That in turn will let us really fix #8285.
We might also want this to include:
- a list of updates that were determined to be completed
- a list of updates that were determined to be impossible and cleared
- a list of updates that we configured
I say "list" -- currently it can only be one, but the code doesn't assume that because it's a policy choice we've made for safety, not a limitation of the implementation.
This sounds like exactly the stuff @jgallagher suggests could go in a follow on PR (which makes sense to me).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That all sounds right to me, and I agree it should be a follow-up.
SledFilter::Commissioned"; | ||
"sled_id" => %sled_id, | ||
); | ||
report.zombie_sleds.push(sled_id); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is a nice example of the improvement this API brings!
"found sled missing NTP zone (will add one)"; | ||
"sled_id" => %sled_id | ||
); | ||
report.sleds_missing_ntp_zone.insert(sled_id); | ||
self.blueprint.record_operation(Operation::AddZone { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we think this new mechanism would eventually replace record_operation()
? (I didn't previously know about that.)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good question; I don't know. But I noticed the similarity, too.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this makes sense yeah -- record_operation just records when an operation was affirmatively performed, which seems like a subset of what the report does.
/// when all planning steps are complete, but before the blueprint | ||
/// has been built (and so we don't yet know its ID). | ||
#[derive(Debug)] | ||
pub(crate) struct InterimPlanningReport { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It looks like the blueprint_id is assigned when we call BlueprintBuilder::build()
. That can only be called once. I wonder if we could remove this whole type (and this file) by generating the blueprint id when we call BlueprintBuilder::new_based_on()
instead and making it available to the caller via a new_blueprint_id()
method. Not a big deal either way.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We can indeed! Done in 48dce2a. It had a larger surface area than I would have liked because we now need to pass the PlannerRng
in to the builder, and so also to Planner::new_based_on
. This touched a lot of tests, but didn't really change anything.
Compute the new blueprint ID up front instead. Requires passing a PlannerRng to the initializers.
77a100a
to
955d5c2
Compare
955d5c2
to
bd0b6d5
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looking good! Thanks, and sorry for my slowness here.
* Noop converting 6/6 install-dataset zones to artifact store on sled 98e6b7c2-2efa-41ca-b20a-0a4d61102fe6 | ||
* Noop converting 5/6 install-dataset zones to artifact store on sled aff6c093-197d-42c5-ad80-9f10ba051a34 | ||
* 1 pending MGS update: | ||
* model0:serial0: Sp { expected_active_version: ArtifactVersion("0.0.1"), expected_inactive_version: NoValidVersion } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
At first I thought: "it's probably worth adding the new version here", then I thought "well, this isn't supposed to be a summary of everything done -- that's the blueprint itself. Maybe it should have even less?" I don't have a strong feeling about this.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think what to put in the report vs just in the blueprint (or its diff) is going to be a continuing tension. Since a report is (currently) tied to a particular planning run, they could be difficult to interpret without their blueprint (diff); on the other hand, we might want them to be more "standalone" so they can be interpreted by a developer or operator without requiring (possibly expunged) blueprints alongside them. We could decide as a matter of policy which way we'd prefer to go here, or just experiment and see what's useful and what's not.
@@ -422,6 +425,9 @@ parent: ad97e762-7bf1-45a6-a98f-60afb7e491c0 | |||
sled 2 model2 serial2 e3b0c44298fc1c149afbf4c8996fb92427ae41e4649b934ca495991b7852b855 1.1.0 Sp { expected_active_version: ArtifactVersion("1.0.0"), expected_inactive_version: Version(ArtifactVersion("1.0.1")) } | |||
|
|||
|
|||
Nothing to report on planning for blueprint cca24b71-09b5-4042-9185-b33e9f2ebba0. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What does this message mean? Does it mean the blueprint is the same as the parent?
edit: I think it might also mean "we loaded this from the database and so have no report for it". Maybe this should say "Empty planning report ..."? That might avoid confusing someone who sees this and thinks there have been no changes, when it might just be that we loaded this from the database.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The intent here is to represent a report that is "empty" in the sense of "devoid of interesting content", and in particular that won't display anything. Previously these would display as just the header:
Planning report for blueprint {blueprint_id}:
I wanted to indicate the empty case explicitly, rather than just not printing anything below the header.
Successful planning runs with nontrivial diffs can result in these if nothing unusual or unexpected occurred. But see above re: tension; it's not clear not me whether the report should be "standalone", in which case we probably don't really want empty reports.
Regardless, "Empty planning report" is indeed more precise, wording updated in dcac3dd.
@@ -1683,6 +1684,9 @@ impl DataStore { | |||
)?; | |||
} | |||
|
|||
// FIXME: Once reports are stored in the database, read them out here. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you file an issue for this?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sure: #8788.
pub enum ZoneUnsafeToShutdown { | ||
Cockroachdb(CockroachdbUnsafeToShutdown), | ||
Cockroachdb { reason: CockroachdbUnsafeToShutdown }, | ||
BoundaryNtp { total_boundary_ntp_zones: usize, synchronized_count: usize }, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think InternalDNS might belong here now.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
5e1a746
to
370c959
Compare
@jgallagher: You were right (as usual): leaving out a planning step report for |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A few comments but nothing blocking landing this -- can be addressed in followups if you think it makes sense.
Planning report for blueprint 8da82a8e-bf97-4fbd-8ddd-9f6462732cf1: | ||
Chicken switches: | ||
add zones with mupdate override: false | ||
|
||
* No zpools in service for NTP zones on sleds: 00320471-945d-413c-85e7-03e091a70b3c | ||
* Discretionary zone placement waiting for NTP zones on sleds: 00320471-945d-413c-85e7-03e091a70b3c |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: could you make these start with lower case to match the other outputs we produce?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
sure, done in 7822a53.
let add = if plan_mupdate_override_res.is_empty() | ||
|| self.input.chicken_switches().add_zones_with_mupdate_override | ||
{ | ||
// If do_plan_mupdate_override returns Waiting, we don't plan *any* | ||
// additional steps until the system has recovered. | ||
if let UpdateStepResult::ContinueToNextStep = | ||
self.do_plan_mgs_updates() | ||
{ | ||
self.do_plan_zone_updates()?; | ||
} | ||
} | ||
self.do_plan_add(&mgs_updates)? | ||
} else { | ||
PlanningAddStepReport::waiting_on( | ||
ZoneAddWaitingOn::MupdateOverrides, | ||
) | ||
}; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can we record the results of both plan_mupdate_override_res.is_empty()
and the add_zones_with_mupdate_override
chicken switch here? one of the things I tried to do was to add logging for when we would have ordinarily have not added zones, but the chicken switch made us do it anyway -- it can be confusing to figure out why a decision was made.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
no problem, done in d02feaf.
"found sled missing NTP zone (will add one)"; | ||
"sled_id" => %sled_id | ||
); | ||
report.sleds_missing_ntp_zone.insert(sled_id); | ||
self.blueprint.record_operation(Operation::AddZone { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this makes sense yeah -- record_operation just records when an operation was affirmatively performed, which seems like a subset of what the report does.
Fixes #8284, fixes #8548.
Replaces many (but not yet all) planner log messages with entries in a structured planning report. Those reports are comprised of "step reports", each of which corresponds to a subroutine of
do_plan
. The complete report may be formatted (in English) via aDisplay
implementation. If planning from theblueprint_planner
background task is successful, a report is attached to theBlueprintPlannerStatus::Targeted
variant so that it may be displayed viaomdb
.TODO, to be handled as follow-ups:
NoopConvertInfo