Skip to content

Min-quorum reconciliation #1733

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Draft
wants to merge 9 commits into
base: main
Choose a base branch
from
Draft

Conversation

mkeeter
Copy link
Contributor

@mkeeter mkeeter commented Jun 25, 2025

(staged on #1732)

This implements RFD 542 and closes #1690.

Here's the quick version:

  • Once we have two downstairs in WaitQuorum, we schedule an event to fire after NEGOTIATION_DELAY (currently 500 ms)
  • If the third Downstairs arrives before this event fires, then we do full-quorum reconciliation (our usual path)
  • Otherwise, then we enter min-quorum reconciliation, marking the third Downstairs as faulted (so it must rejoin through live-repair)

I'm opening this as a draft because I want to see how the CI tests go. The PR includes integration tests for common and uncommon orderings of events, but it's hard to hit every possible path due to specific timing requirements; I'm very open to suggestions for other tests.

Before merging, we also need to figure out #1661 (comment)

@mkeeter mkeeter requested review from jmpesp and leftwo June 25, 2025 19:23
@mkeeter
Copy link
Contributor Author

mkeeter commented Jun 25, 2025

Quick notes on failing CI tests:

test-repair

EXT  BLOCKS GEN0 GEN1 GEN2  FL0 FL1 FL2  D0 D1 D2 DIFF
  0 000-019    5    5    2   10  10   5   F  F  T <---
  1 020-039    5    5    1   10  10   1   F  F  T <---
  2 040-059    5    5    1   10  10   1   F  F  T <---
  3 060-079    1    1    1    1   1   1   F  F  F
  4 080-099    4    4    4    9   9   9   F  F  F
  5 100-119    5    5    1   10  10   1   F  F  T <---
  6 120-139    5    5    1   10  10   1   F  F  T <---
  7 140-159    5    5    2   10  10   6   F  F  T <---
  8 160-179    2    2    2    3   3   3   F  F  F
  9 180-199    5    5    1   10  10   1   F  F  T <---
 10 200-219    5    5    1   10  10   1   F  F  T <---
 11 220-239    5    5    4   10  10   9   F  F  T <---
 12 240-259    5    5    1   10  10   1   F  F  T <---
 13 260-279    5    5    1   10  10   1   F  F  T <---
 14 280-299    5    5    1   10  10   1   F  F  T <---
 15 300-319    5    5    1   10  10   1   F  F  T <---
 16 320-339    5    5    2   10  10   6   F  F  T <---
 17 340-359    1    1    1    1   1   1   F  F  F
 18 360-379    5    5    1   10  10   1   F  F  T <---
 19 380-399    5    5    2   10  10   6   F  F  T <---
 20 400-419    2    2    2    4   4   4   F  F  F
 21 420-439    1    1    1    1   1   1   F  F  F
 22 440-459    1    1    1    1   1   1   F  F  F
 23 460-479    1    1    1    1   1   1   F  F  F
 24 480-499    5    5    2   10  10   6   F  F  T <---
 25 500-519    5    5    2   10  10   6   F  F  T <---
 26 520-539    5    5    1   10  10   1   F  F  T <---
 27 540-559    3    3    3    7   7   7   F  F  F
 28 560-579    1    1    1    1   1   1   F  F  F
 29 580-599    2    2    2    3   3   3   F  F  F
Max gen: 5,  Max flush: 10
Error: Difference in extent metadata found!

This is probably because it's easier to leave the Downstairs in mismatched states, e.g. activation no longer guarantees that they're all identical.

test-up-*

thread 'main' panicked at crutest/src/main.rs:2559:9:
assertion failed: !is_active
note: run with `RUST_BACKTRACE=1` environment variable to display a backtrace
{"msg":"[1] activate should now be waiting true","v":0,"name":"crucible","level":30,"time":"2025-06-25T19:31:21.840999142Z","hostname":"w-01JYM8FQQ54H9R8BDCCJ9EAVVD","pid":1304,"task":"crutest"}
Failed crutest replace-before-active

crutest assumes that activation will wait until all 3x Downstairs are available, which is no longer true

@mkeeter mkeeter force-pushed the mkeeter/min-quorum-negotiation branch from de311a2 to 1c75dc7 Compare June 30, 2025 13:11
@mkeeter mkeeter force-pushed the mkeeter/collate-cleanup branch from 21a03f8 to 1b91f6c Compare June 30, 2025 13:11
Base automatically changed from mkeeter/collate-cleanup to main June 30, 2025 13:51
@mkeeter mkeeter force-pushed the mkeeter/min-quorum-negotiation branch from 1c75dc7 to ac5c71a Compare June 30, 2025 13:52
@mkeeter mkeeter marked this pull request as draft June 30, 2025 13:52
@mkeeter
Copy link
Contributor Author

mkeeter commented Jun 30, 2025

I've fixed the replace-before-active test in 8e21369 by deactivating two downstairs to prevent activation (previously, we had deactivated a single downstairs, but that's no longer sufficient).

@mkeeter
Copy link
Contributor Author

mkeeter commented Jun 30, 2025

I'm hopefully fixing the verification test in cc512f4 , which waits for the third Downstairs to start (so that we can do full-quorum reconciliation before verifying on-disk data).

Copy link
Contributor

@leftwo leftwo left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A few comments to start, I'm still looking though.

@@ -2622,12 +2631,13 @@ async fn replace_before_active(
tokio::time::sleep(tokio::time::Duration::from_secs(4)).await;
}

old_ds = (old_ds + 1) % (ds_total as u32 + 1);
old_ds_a = (old_ds_a + 1) % (ds_total as u32 + 1);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmmm... Will this test pass if you do 2 region sets?
I think this logic can get you 1 downstairs in one region set, and a 2nd in a different region set, which is not the behavior we want here. Also, If we don't have multiple region set testing turned on by default, I'm going to go do that everywhere.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good question! I notice a CI failure in test-up-2region-encrypted, so I suspect that you're correct 😄

Copy link
Contributor Author

@mkeeter mkeeter Jun 30, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this is fixed in d2d168e bdcfc08.

It's somewhat subtle: the unused downstairs is always right before the downstairs we're about to replace, but the second downstairs in that set could either be before or after (and shifts over time). That commit tracks which downstairs is in which region set, so that we can always disable 2 in the same region set (and prevent activation, which is the whole point).

@mkeeter mkeeter force-pushed the mkeeter/min-quorum-negotiation branch from 4d632b5 to cf1e0f8 Compare June 30, 2025 20:33
@mkeeter mkeeter force-pushed the mkeeter/min-quorum-negotiation branch from d2d168e to bdcfc08 Compare July 1, 2025 13:24
Copy link
Contributor

@jmpesp jmpesp left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Some comments, and also: there should be some additional downstairs tests for the non-"all clients are participating" cases.

Comment on lines +443 to +447
if let DsStateData::Connecting {
state,
mode: ConnectionMode::New,
..
} = &mut self.state
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It seems wrong that begin_reconcile now doesn't panic or return an error if the state is incorrect?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This was an intentional behavior change: we are using begin_reconcile to also detect which members were participating in reconciliation.

However, we also check this when building the mend list, so I've moved stuff around: mismatch_list now returns both the mend list and the participating downstairs, and this function is back to panicking if you call it on a Downstairs in the wrong state.

29dc1af

Comment on lines 1890 to 1891
/// Returns `false` if we aren't ready, or if things failed. If there's a
/// failure, then we also update the client state.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd vote for returning Result<bool, some error> instead - having false cover both cases seems wrong

Copy link
Contributor Author

@mkeeter mkeeter Jul 11, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Counterproposal: let's remove the return value entirely! It's only ever used in one unit test, so I've updated that test to look at states instead.

fb2b632

return;
};
assert!(min_quorum_deadline.is_some());
*min_quorum_deadline = None;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think clearing min_quorum_deadline here means that if ready_count != 2 (checked below), control will leave this function and not attempt min quorum activation again?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree, but I think this is fine: next time we end up with 2 ready Downstairs, then we'll reschedule min_quorum_deadline.

..
} => {
assert!(!did_work);
c.set_active();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

not faulted?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nope, this is for the case where we've found that no reconciliation is necessary, so we go straight from WaitQuorum without passing through Reconcile (equivalent to this conditional in the old code).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Support activation with 2/3 read/write downstairs present.
3 participants