-
Notifications
You must be signed in to change notification settings - Fork 46
[34/n] sled-agent logic to clear mupdate overrides #8572
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
[34/n] sled-agent logic to clear mupdate overrides #8572
Conversation
Created using spr 1.3.6-beta.1
Created using spr 1.3.6-beta.1 [skip ci]
Created using spr 1.3.6-beta.1 [skip ci]
Created using spr 1.3.6-beta.1
// Reconcile the mupdate override field. This can be done independently | ||
// of the other parts of reconciliation (and this doesn't have to block | ||
// other parts of reconciliation), but the argument for this is somewhat | ||
// non-trivial. Here's an outline: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Worth having a look at this.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've moved this to https://rfd.shared.oxide.computer/rfd/0556#sa_reconciler_error_handling.
#8596) While working on #8572, we realized that we need to put in a condition saying that if any zones can't be switched over to Artifact, we must not clear the mupdate override field. This effectively requires information about eligibility to be available in two spots: while updating the mupdate override, and while doing these noop conversions. In order to do that more easily, this PR builds up a decision tree by sled.
Created using spr 1.3.6-beta.1 [skip ci]
Created using spr 1.3.6-beta.1
Created using spr 1.3.6-beta.1 [skip ci]
Created using spr 1.3.6-beta.1
Created using spr 1.3.6-beta.1 [skip ci]
Created using spr 1.3.6-beta.1
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM, just one nontrivial comment about the current internal disks.
nexus/types/src/inventory/display.rs
Outdated
writeln!( | ||
indent2, | ||
"error reading mupdate override, so sled agent was \ | ||
not instructed to clear it" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
"not instructed to" or "didn't attempt to"?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
"not instructed to" because the instruction comes from the planner.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
But an error here comes from sled-agent, right? This is probably fine, it just feels slightly weird that sled-agent is making a claim about the behavior of something else that it kinda has to infer. Could we have a case like:
- current ledgered config has no mupdate override and no clear mupdate override
- sled is mupdated
- sled comes up and fails to read mupdate override
- inventory is collected
- in this display, we claim the planner didn't instruct us to do something, but the planner hasn't even run at all since we've been mupdated; we're still running out of the old ledgered config (in which the planner hadn't instructed us to clear a mupdate override because there wasn't one)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah yeah, good point -- will change to "didn't attempt to". It's definitely a bit weird.
let clear_mupdate_override = | ||
if let Some(override_id) = sled_config.remove_mupdate_override { | ||
let internal_disks = | ||
self.internal_disks_rx.wait_for_boot_disk().await; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This feels sketchy; if the boot disk has gone away for some reason this will block until it comes back (maybe forever). Can we instead grab current()
here and have clear_mupdate_override()
return an error if the boot disk isn't in the current set?
Maybe we should also rename this method wait_forever_for_boot_disk()
to make it scarier sounding?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, wait_forever_for_boot_disk
sounds reasonable to me.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Updated this.
@@ -181,12 +350,136 @@ fn make_non_boot_info( | |||
} | |||
} | |||
|
|||
fn clear_non_boot_disk( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This looks fine but oof it seems brutal. Do we already have an issue for "just use M2Slot::A
for all ledgers and ignore boot/non-boot"?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Created using spr 1.3.6-beta.1
Created using spr 1.3.6-beta.1
This PR implements logic within sled-agent to clear mupdate overrides. Includes tests, database storage, and displayers.
This logic by itself does not introduce behavior changes, since the code to actually set this field is in #8456.
Depends on: