-
Notifications
You must be signed in to change notification settings - Fork 869
feat(shard-distributor): implement WatchNamespaceState streaming RPC #7432
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
Signed-off-by: Jakob Haahr Taankvist <[email protected]>
Signed-off-by: Jakob Haahr Taankvist <[email protected]>
Signed-off-by: Jakob Haahr Taankvist <[email protected]>
Signed-off-by: Jakob Haahr Taankvist <[email protected]>
Signed-off-by: Jakob Haahr Taankvist <[email protected]>
Signed-off-by: Jakob Haahr Taankvist <[email protected]>
Signed-off-by: Jakob Haahr Taankvist <[email protected]>
| // Stream subsequent updates | ||
| for { | ||
| select { | ||
| case <-server.Context().Done(): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
what if we stop shardDistributor? is it implicitly handled witht he server context?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good question - I assume so - it's the only context availible at least
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we can test this, shutting down the shard distributor and checking that the canaries are not hanging but they connect to a new stream :)
| select { | ||
| case sub <- state: | ||
| default: | ||
| // Subscriber is not reading fast enough, skip this update |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
should we retry? we call refresh and then publish only in case of changes, let's say that no changes happen, some subscribers will have stale info until next change, are we fine with that?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think that is a good point - maybe we can send a reconciliation message every 1s?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yes, this is a good idea
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Will do a follow up PR
| // Subscribe returns a channel that receives executor state updates. | ||
| func (p *executorStatePubSub) subscribe(ctx context.Context) (<-chan map[*store.ShardOwner][]string, func()) { | ||
| ch := make(chan map[*store.ShardOwner][]string) | ||
| uniqueID := uuid.New().String() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
thinking out loud, should we return the subscription ID for debug purposes?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't see the value, but maybe if you elaborate a bit?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In case of a issues with a subscription we only have the subscriptionID stored on SD side and we don't know which instance is not receiving updates. We can understand which namespace is impacted but maybe it is too wide. I am thinking if we should prepend the caller instance to this uid for example.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ill do a followup PR to add a spectator ID so we can make this connection
What changed?
Implemented the
WatchNamespaceStatestreaming RPC endpoint for the shard distributor service, including a pub/sub mechanism for real-time assignment change notifications.Why?
The
WatchNamespaceStateendpoint was previously unimplemented. This enables executors and spectators to receive real-time updates about shard assignment changes without polling, improving responsiveness and reducing load on the storage layer.How did you test it?
Added unit tests for the handler's streaming behavior and the pub/sub mechanism.
Potential risks
Low - this is a new feature in an experimental service. The pub/sub implementation includes non-blocking publish to prevent slow subscribers from blocking the system.
Release notes
N/A - shard distributor is experimental
Documentation Changes
None required