Skip to content

Conversation

@toddbaert
Copy link
Member

This PR specifies some provider behavior, specifically around stream health, gRPC retry policy, and FATAL codes.

Specifically, it:

@toddbaert toddbaert requested review from a team as code owners October 30, 2025 16:23
@netlify
Copy link

netlify bot commented Oct 30, 2025

Deploy Preview for polite-licorice-3db33c ready!

Name Link
🔨 Latest commit 48a46ea
🔍 Latest deploy log https://app.netlify.com/projects/polite-licorice-3db33c/deploys/6903943709115b0008a4908d
😎 Deploy Preview https://deploy-preview-1818--polite-licorice-3db33c.netlify.app
📱 Preview on mobile
Toggle QR Code...

QR Code

Use your smartphone camera to open QR code link.

To edit notification comments on pull requests, go to your Netlify project configuration.

@dosubot dosubot bot added the size:L This PR changes 100-499 lines, ignoring generated files. label Oct 30, 2025
@gemini-code-assist
Copy link
Contributor

Summary of Changes

Hello @toddbaert, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request significantly enhances the robustness and predictability of flagd provider behavior by formalizing how providers handle stream health, gRPC retry mechanisms, and fatal error conditions. It introduces a standardized retry policy for transient network issues and a critical new feature to recognize and react to non-transient (fatal) gRPC errors, preventing endless reconnection loops. Additionally, the documentation now explicitly outlines provider re-initialization capabilities, ensuring clearer and more reliable provider operations.

Highlights

  • Standardized Retry Policy: A clear gRPC retry policy has been published, to be adopted by all flagd providers, specifically for UNAVAILABLE and UNKNOWN status codes.
  • Fatal Error Handling: A new mechanism has been introduced to mark certain gRPC status codes as FATAL, which will cause providers to cease reconnection attempts for non-transient errors.
  • Provider Re-initialization: The state diagram has been updated to explicitly clarify that flagd providers should support re-initialization, provided they are not in a FATAL state.
Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point by creating a comment using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands on the current page.

Feature Command Description
Code Review /gemini review Performs a code review for the current pull request in its current state.
Pull Request Summary /gemini summary Provides a summary of the current pull request in its current state.
Comment @gemini-code-assist Responds in comments when explicitly tagged, both in pull request comments and review comments.
Help /gemini help Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here.

You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.

Footnotes

  1. Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution.

Comment on lines +69 to +74
STALE --> NOT_READY: shutdown
ERROR --> READY: reconnected
ERROR --> [*]: shutdown
ERROR --> NOT_READY: shutdown
ERROR --> [*]: Error code == PROVIDER_FATAL
note right of STALE
note left of STALE
Copy link
Member Author

@toddbaert toddbaert Oct 30, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

old:

Image

new:

Image

The main different is we make it clear transitions are possible from non-fatal ERROR, back to NOT_READY... many implementations already support this, but not all.
I think it makes sense to specify this so we can be consistent.

Copy link
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request updates the provider specification to clarify behavior around stream health, gRPC retry policies, and fatal error codes. The changes include updating the state diagram, defining a gRPC retry policy, and introducing the concept of fatal status codes that stop reconnection attempts. The documentation is clearer as a result. I've found a few issues: an invalid JSON example for the retry policy, an inconsistency in the number of retries described, and a minor stylistic point.

While the provider is in state `STALE` the provider resolves values from its cache or stored flag set rules, depending on its resolver mode.
When the time since the last disconnect first exceeds `retryGracePeriod`, the provider emits `ERROR`.
The provider attempts to reconnect indefinitely, with a maximum interval of `retryBackoffMaxMs`.
```json
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is standard retryPolicy, accepted in this JSON format by most gRPC implementations.

| offlineFlagSourcePath | FLAGD_OFFLINE_FLAG_SOURCE_PATH | offline, file-based flag definitions, overrides host/port/targetUri | string | null | file |
| offlinePollIntervalMs | FLAGD_OFFLINE_POLL_MS | poll interval for reading offlineFlagSourcePath | int | 5000 | file |
| contextEnricher | - | sync-metadata to evaluation context mapping function | function | identity function | in-process |
| fatalStatusCodes | - | a list of gRPC status codes, which will cause streams to give up and put the provider in a PROVIDER_FATAL state | array | [] | rpc & in-process |
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is the only new option - the other changes are just whitespace.

toddbaert and others added 4 commits October 30, 2025 12:32
Signed-off-by: Todd Baert <[email protected]>
Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>
Signed-off-by: Todd Baert <[email protected]>
Signed-off-by: Todd Baert <[email protected]>
Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>
Signed-off-by: Todd Baert <[email protected]>
@sonarqubecloud
Copy link

When either stream (sync or event) disconnects, whether due to the associated deadline being exceeded, network error or any other cause, the provider attempts to re-establish the stream immediately, and then retries with an exponential back-off.
We always rely on the [integrated functionality of GRPC for reconnection](https://github.com/grpc/grpc/blob/master/doc/connection-backoff.md) and utilize [Wait-for-Ready](https://grpc.io/docs/guides/wait-for-ready/) to re-establish the stream.
We are configuring the underlying reconnection mechanism whenever we can, based on our configuration. (not all GRPC implementations support this)
When either stream (sync or event) disconnects, whether due to the associated deadline being exceeded, network error or any other cause, the provider attempts to re-establish the stream immediately.
Copy link
Member

@aepfli aepfli Oct 31, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Take this with a salt of grain (as i am not sure i understood it correctly), but there are two different things, reconnection and retry. But my knowledge might be off here, so the reconnect is happening on the channel versus the retry is for the stream. So i do think that this table might be interesting for people to see how our reconnection attempt on a lost channel looks like.

No other status codes are retried.
The flagd gRPC retry policy is specified below:

When disconnected, if the time since disconnection is less than `retryGracePeriod`, the provider emits `STALE` when it disconnects.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am not sure if just adding this to the overview in the mermaid chart is sufficient enough, i think this should be also explicitly mentioned

| offlineFlagSourcePath | FLAGD_OFFLINE_FLAG_SOURCE_PATH | offline, file-based flag definitions, overrides host/port/targetUri | string | null | file |
| offlinePollIntervalMs | FLAGD_OFFLINE_POLL_MS | poll interval for reading offlineFlagSourcePath | int | 5000 | file |
| contextEnricher | - | sync-metadata to evaluation context mapping function | function | identity function | in-process |
| fatalStatusCodes | - | a list of gRPC status codes, which will cause streams to give up and put the provider in a PROVIDER_FATAL state | array | [] | rpc & in-process |
Copy link
Contributor

@alexandraoberaigner alexandraoberaigner Nov 10, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
| fatalStatusCodes | - | a list of gRPC status codes, which will cause streams to give up and put the provider in a PROVIDER_FATAL state | array | [] | rpc & in-process |
| fatalStatusCodes | FLAGD_FATAL_STATUS_CODES | a list of gRPC status codes, which will cause streams to give up and put the provider in a PROVIDER_FATAL state | array | [] | rpc & in-process |

@toddbaert @aepfli We should probably make sure the change is consistent with:
https://github.com/open-feature/flagd-testbed/pull/311/files#diff-2f3b6fc7d0eec288e7349f23f8f56b197eecf05fef9320930ee266dda60fd6e7R25

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

size:L This PR changes 100-499 lines, ignoring generated files.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants