-
Notifications
You must be signed in to change notification settings - Fork 93
docs: fatal codes, re-init, and retry policy #1818
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
Signed-off-by: Todd Baert <[email protected]>
✅ Deploy Preview for polite-licorice-3db33c ready!
To edit notification comments on pull requests, go to your Netlify project configuration. |
Summary of ChangesHello @toddbaert, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed! This pull request significantly enhances the robustness and predictability of flagd provider behavior by formalizing how providers handle stream health, gRPC retry mechanisms, and fatal error conditions. It introduces a standardized retry policy for transient network issues and a critical new feature to recognize and react to non-transient (fatal) gRPC errors, preventing endless reconnection loops. Additionally, the documentation now explicitly outlines provider re-initialization capabilities, ensuring clearer and more reliable provider operations. Highlights
Using Gemini Code AssistThe full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips. Invoking Gemini You can request assistance from Gemini at any point by creating a comment using either
Customization To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a Limitations & Feedback Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here. You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension. Footnotes
|
| STALE --> NOT_READY: shutdown | ||
| ERROR --> READY: reconnected | ||
| ERROR --> [*]: shutdown | ||
| ERROR --> NOT_READY: shutdown | ||
| ERROR --> [*]: Error code == PROVIDER_FATAL | ||
| note right of STALE | ||
| note left of STALE |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Code Review
This pull request updates the provider specification to clarify behavior around stream health, gRPC retry policies, and fatal error codes. The changes include updating the state diagram, defining a gRPC retry policy, and introducing the concept of fatal status codes that stop reconnection attempts. The documentation is clearer as a result. I've found a few issues: an invalid JSON example for the retry policy, an inconsistency in the number of retries described, and a minor stylistic point.
| While the provider is in state `STALE` the provider resolves values from its cache or stored flag set rules, depending on its resolver mode. | ||
| When the time since the last disconnect first exceeds `retryGracePeriod`, the provider emits `ERROR`. | ||
| The provider attempts to reconnect indefinitely, with a maximum interval of `retryBackoffMaxMs`. | ||
| ```json |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is standard retryPolicy, accepted in this JSON format by most gRPC implementations.
| | offlineFlagSourcePath | FLAGD_OFFLINE_FLAG_SOURCE_PATH | offline, file-based flag definitions, overrides host/port/targetUri | string | null | file | | ||
| | offlinePollIntervalMs | FLAGD_OFFLINE_POLL_MS | poll interval for reading offlineFlagSourcePath | int | 5000 | file | | ||
| | contextEnricher | - | sync-metadata to evaluation context mapping function | function | identity function | in-process | | ||
| | fatalStatusCodes | - | a list of gRPC status codes, which will cause streams to give up and put the provider in a PROVIDER_FATAL state | array | [] | rpc & in-process | |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is the only new option - the other changes are just whitespace.
Signed-off-by: Todd Baert <[email protected]>
Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com> Signed-off-by: Todd Baert <[email protected]>
Signed-off-by: Todd Baert <[email protected]>
Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com> Signed-off-by: Todd Baert <[email protected]>
|
| When either stream (sync or event) disconnects, whether due to the associated deadline being exceeded, network error or any other cause, the provider attempts to re-establish the stream immediately, and then retries with an exponential back-off. | ||
| We always rely on the [integrated functionality of GRPC for reconnection](https://github.com/grpc/grpc/blob/master/doc/connection-backoff.md) and utilize [Wait-for-Ready](https://grpc.io/docs/guides/wait-for-ready/) to re-establish the stream. | ||
| We are configuring the underlying reconnection mechanism whenever we can, based on our configuration. (not all GRPC implementations support this) | ||
| When either stream (sync or event) disconnects, whether due to the associated deadline being exceeded, network error or any other cause, the provider attempts to re-establish the stream immediately. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Take this with a salt of grain (as i am not sure i understood it correctly), but there are two different things, reconnection and retry. But my knowledge might be off here, so the reconnect is happening on the channel versus the retry is for the stream. So i do think that this table might be interesting for people to see how our reconnection attempt on a lost channel looks like.
| No other status codes are retried. | ||
| The flagd gRPC retry policy is specified below: | ||
|
|
||
| When disconnected, if the time since disconnection is less than `retryGracePeriod`, the provider emits `STALE` when it disconnects. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am not sure if just adding this to the overview in the mermaid chart is sufficient enough, i think this should be also explicitly mentioned
| | offlineFlagSourcePath | FLAGD_OFFLINE_FLAG_SOURCE_PATH | offline, file-based flag definitions, overrides host/port/targetUri | string | null | file | | ||
| | offlinePollIntervalMs | FLAGD_OFFLINE_POLL_MS | poll interval for reading offlineFlagSourcePath | int | 5000 | file | | ||
| | contextEnricher | - | sync-metadata to evaluation context mapping function | function | identity function | in-process | | ||
| | fatalStatusCodes | - | a list of gRPC status codes, which will cause streams to give up and put the provider in a PROVIDER_FATAL state | array | [] | rpc & in-process | |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
| | fatalStatusCodes | - | a list of gRPC status codes, which will cause streams to give up and put the provider in a PROVIDER_FATAL state | array | [] | rpc & in-process | | |
| | fatalStatusCodes | FLAGD_FATAL_STATUS_CODES | a list of gRPC status codes, which will cause streams to give up and put the provider in a PROVIDER_FATAL state | array | [] | rpc & in-process | |
@toddbaert @aepfli We should probably make sure the change is consistent with:
https://github.com/open-feature/flagd-testbed/pull/311/files#diff-2f3b6fc7d0eec288e7349f23f8f56b197eecf05fef9320930ee266dda60fd6e7R25





This PR specifies some provider behavior, specifically around stream health, gRPC retry policy, and FATAL codes.
Specifically, it: