-
Notifications
You must be signed in to change notification settings - Fork 3.7k
[fix][broker]Data lost due to conflict loaded up a topic for two brokers, when enabled ServiceUnitStateMetadataStoreTableViewImpl #24478
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
…rs, when enabled ServiceUnitStateMetadataStoreTableViewImpl
@poorbarcode Do you know why Ledger 1049720 gets deleted? Or is it expected to be deleted? |
I am not sure of the exact details, I guess there is a possibility: Since two brokers assumed it is the owner, they overwrite the cursor's metadata, then we get a wrong cursor |
...pulsar/broker/loadbalance/extensions/channel/ServiceUnitStateMetadataStoreTableViewImpl.java
Outdated
Show resolved
Hide resolved
...pulsar/broker/loadbalance/extensions/channel/ServiceUnitStateMetadataStoreTableViewImpl.java
Show resolved
Hide resolved
.../java/org/apache/pulsar/broker/loadbalance/extensions/channel/ServiceUnitStateTableView.java
Show resolved
Hide resolved
...pulsar/broker/loadbalance/extensions/channel/ServiceUnitStateMetadataStoreTableViewImpl.java
Show resolved
Hide resolved
Codecov Report❌ Patch coverage is Additional details and impacted files@@ Coverage Diff @@
## master #24478 +/- ##
============================================
+ Coverage 73.57% 74.22% +0.65%
- Complexity 32624 32819 +195
============================================
Files 1877 1868 -9
Lines 139502 145823 +6321
Branches 15299 16717 +1418
============================================
+ Hits 102638 108244 +5606
- Misses 28908 28991 +83
- Partials 7956 8588 +632
Flags with carried forward coverage won't be shown. Click here to find out more.
🚀 New features to boost your workflow:
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
Hi guys! I think this issue probably might have revealed a bigger problem. In fact, by design, I think pulsar brokers can see temporary inconsistent ownership views because of network delay or zk session issues. In this case, brokers could see conflict ownership views, but by the "pulsar fencing logic", only one broker should serve(write to) the topic, and the other brokers should close the topic/ledger from the fenced topic / metadata conflict exception. Can we confirm if pulsar fencing logic works fine? |
Regarding this warn
I think we can consider failing new lookups/ownership assignments in this situation, when the broker lost metadata connections. Returning the current candidate "owner" from the local info was added to improve availability, but it appears that we better optimize for consistency. |
Now I think this metadatastore, metadatacache, metadatatableview sync logic makes the code too complicated and fragile. We could consider removing this shadow hashmap(the hashmap in the tableview) and directly using the metadata cache for the metadata store tableview. (This requires handling CompletableFuture more gracefully) |
Thanks @heesung-sn
It does not work fine, because the first owner has loss notifications because of a session lost.
Good pointer, the current PR has fixed it. See also https://github.com/apache/pulsar/pull/24478/files#diff-191ed1df4f5804c8fd6cdaf909cca01d071a110c3e701bdc98352f99e7a92e8eR469
The map has another useful case: we need to unload owned topics once the broker state is not fine, the map maintains which bundle is owned by the current broker. |
Interesting. I thought the fencing logic is guarded by the quorum of bk, regardless of zk. |
…ers, when enabled ServiceUnitStateMetadataStoreTableViewImpl (apache#24478)
In fact, the original design was when the metadata store is unstable, pulsar should behave as-is at the best effort basis(minimal impact on existing clients) This "unloading all topics when lost zk connection" seems to be a new behavior.
Ok. One brute-force way to sync cache and tv in this case is to re-init the tableview(refresh the tableview again when connection is reestablished.) I think you already added this logic in this PR. |
@poorbarcode I raised this PR to print warn log when they are out of sync, #24513 |
…ers, when enabled ServiceUnitStateMetadataStoreTableViewImpl (apache#24478) (cherry picked from commit ce102da) (cherry picked from commit 88666bd)
…ers, when enabled ServiceUnitStateMetadataStoreTableViewImpl (apache#24478) (cherry picked from commit ce102da) (cherry picked from commit 88666bd)
…ers, when enabled ServiceUnitStateMetadataStoreTableViewImpl (apache#24478)
Motivation
We encountered an issue that the ledger of a topic was lost after restarting ZKs
After checking brokers' logs, we found that both brokers loaded up the same topic.
And we found the following logs that printed by
broker-0
Root cause
broker-0
lost the ZK sessionbroker-0
tobroker-2
broker-0
loses the events of bundle owner switching because it has lost the sessionbroker-0
is the topic's owner.Modifications
MetadataStoreTableViewImpl
itemOutdatedListeners
, which guarantees compatibilityDocumentation
doc
doc-required
doc-not-needed
doc-complete
Matching PR in forked repository
PR in forked repository: x