docs: fatal codes, re-init, and retry policy #1818

toddbaert · 2025-10-30T16:23:06Z

This PR specifies some provider behavior, specifically around stream health, gRPC retry policy, and FATAL codes.

Specifically, it:

publishes a retry policy that is shall be used by all flagd providers
specifies a new option for marking some gRPC status codes as FATAL, which will cause the provider to stop attempting to reconnect (generally useful and requested in Infinite retry to establish connection to FlagSyncService in Flagd golang provider go-sdk-contrib#756)
makes clear via state diagram that flagd provider should support re-initialization (if not in FATAL state)

Signed-off-by: Todd Baert <[email protected]>

netlify · 2025-10-30T16:23:12Z

✅ Deploy Preview for polite-licorice-3db33c ready!

Name	Link
🔨 Latest commit	`48a46ea`
🔍 Latest deploy log	https://app.netlify.com/projects/polite-licorice-3db33c/deploys/6903943709115b0008a4908d
😎 Deploy Preview	https://deploy-preview-1818--polite-licorice-3db33c.netlify.app
📱 Preview on mobile	Toggle QR Code... Use your smartphone camera to open QR code link.

To edit notification comments on pull requests, go to your Netlify project configuration.

gemini-code-assist · 2025-10-30T16:23:23Z

Summary of Changes

Hello @toddbaert, I'm Gemini Code Assist¹! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request significantly enhances the robustness and predictability of flagd provider behavior by formalizing how providers handle stream health, gRPC retry mechanisms, and fatal error conditions. It introduces a standardized retry policy for transient network issues and a critical new feature to recognize and react to non-transient (fatal) gRPC errors, preventing endless reconnection loops. Additionally, the documentation now explicitly outlines provider re-initialization capabilities, ensuring clearer and more reliable provider operations.

Highlights

Standardized Retry Policy: A clear gRPC retry policy has been published, to be adopted by all flagd providers, specifically for UNAVAILABLE and UNKNOWN status codes.
Fatal Error Handling: A new mechanism has been introduced to mark certain gRPC status codes as FATAL, which will cause providers to cease reconnection attempts for non-transient errors.
Provider Re-initialization: The state diagram has been updated to explicitly clarify that flagd providers should support re-initialization, provided they are not in a FATAL state.

Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point by creating a comment using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands on the current page.

Feature	Command	Description
Code Review	`/gemini review`	Performs a code review for the current pull request in its current state.
Pull Request Summary	`/gemini summary`	Provides a summary of the current pull request in its current state.
Comment	@gemini-code-assist	Responds in comments when explicitly tagged, both in pull request comments and review comments.
Help	`/gemini help`	Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here.

You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.

Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution. ↩

toddbaert · 2025-10-30T16:25:21Z

docs/reference/specifications/providers.md

+    STALE --> NOT_READY: shutdown
    ERROR --> READY: reconnected
-    ERROR --> [*]: shutdown
+    ERROR --> NOT_READY: shutdown
+    ERROR --> [*]: Error code == PROVIDER_FATAL

-    note right of STALE
+    note left of STALE


old:

new:

The main different is we make it clear transitions are possible from non-fatal ERROR, back to NOT_READY... many implementations already support this, but not all.
I think it makes sense to specify this so we can be consistent.

gemini-code-assist

Code Review

This pull request updates the provider specification to clarify behavior around stream health, gRPC retry policies, and fatal error codes. The changes include updating the state diagram, defining a gRPC retry policy, and introducing the concept of fatal status codes that stop reconnection attempts. The documentation is clearer as a result. I've found a few issues: an invalid JSON example for the retry policy, an inconsistency in the number of retries described, and a minor stylistic point.

docs/reference/specifications/providers.md

toddbaert · 2025-10-30T16:27:20Z

docs/reference/specifications/providers.md

-While the provider is in state `STALE` the provider resolves values from its cache or stored flag set rules, depending on its resolver mode.
-When the time since the last disconnect first exceeds `retryGracePeriod`, the provider emits `ERROR`.
-The provider attempts to reconnect indefinitely, with a maximum interval of `retryBackoffMaxMs`.
+```json


This is standard retryPolicy, accepted in this JSON format by most gRPC implementations.

toddbaert · 2025-10-30T16:27:43Z

docs/reference/specifications/providers.md

+| offlineFlagSourcePath | FLAGD_OFFLINE_FLAG_SOURCE_PATH | offline, file-based flag definitions, overrides host/port/targetUri                                             | string                       | null                          | file                    |
+| offlinePollIntervalMs | FLAGD_OFFLINE_POLL_MS          | poll interval for reading offlineFlagSourcePath                                                                 | int                          | 5000                          | file                    |
+| contextEnricher       | -                              | sync-metadata to evaluation context mapping function                                                            | function                     | identity function             | in-process              |
+| fatalStatusCodes      | -                              | a list of gRPC status codes, which will cause streams to give up and put the provider in a PROVIDER_FATAL state | array                        | []                            | rpc & in-process        |


This is the only new option - the other changes are just whitespace.

Signed-off-by: Todd Baert <[email protected]>

Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com> Signed-off-by: Todd Baert <[email protected]>

Signed-off-by: Todd Baert <[email protected]>

Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com> Signed-off-by: Todd Baert <[email protected]>

sonarqubecloud · 2025-10-30T16:37:43Z

Quality Gate passed

Issues
0 New issues
0 Accepted issues

Measures
0 Security Hotspots
0.0% Coverage on New Code
0.0% Duplication on New Code

See analysis details on SonarQube Cloud

aepfli · 2025-10-31T10:51:02Z

docs/reference/specifications/providers.md

-When either stream (sync or event) disconnects, whether due to the associated deadline being exceeded, network error or any other cause, the provider attempts to re-establish the stream immediately, and then retries with an exponential back-off.
-We always rely on the [integrated functionality of GRPC for reconnection](https://github.com/grpc/grpc/blob/master/doc/connection-backoff.md) and utilize [Wait-for-Ready](https://grpc.io/docs/guides/wait-for-ready/) to re-establish the stream.
-We are configuring the underlying reconnection mechanism whenever we can, based on our configuration. (not all GRPC implementations support this)
+When either stream (sync or event) disconnects, whether due to the associated deadline being exceeded, network error or any other cause, the provider attempts to re-establish the stream immediately.


Take this with a salt of grain (as i am not sure i understood it correctly), but there are two different things, reconnection and retry. But my knowledge might be off here, so the reconnect is happening on the channel versus the retry is for the stream. So i do think that this table might be interesting for people to see how our reconnection attempt on a lost channel looks like.

aepfli · 2025-10-31T10:53:52Z

docs/reference/specifications/providers.md

+No other status codes are retried.
+The flagd gRPC retry policy is specified below:

-When disconnected, if the time since disconnection is less than `retryGracePeriod`, the provider emits `STALE` when it disconnects.


I am not sure if just adding this to the overview in the mermaid chart is sufficient enough, i think this should be also explicitly mentioned

alexandraoberaigner · 2025-11-10T13:10:50Z

docs/reference/specifications/providers.md

+| offlineFlagSourcePath | FLAGD_OFFLINE_FLAG_SOURCE_PATH | offline, file-based flag definitions, overrides host/port/targetUri                                             | string                       | null                          | file                    |
+| offlinePollIntervalMs | FLAGD_OFFLINE_POLL_MS          | poll interval for reading offlineFlagSourcePath                                                                 | int                          | 5000                          | file                    |
+| contextEnricher       | -                              | sync-metadata to evaluation context mapping function                                                            | function                     | identity function             | in-process              |
+| fatalStatusCodes      | -                              | a list of gRPC status codes, which will cause streams to give up and put the provider in a PROVIDER_FATAL state | array                        | []                            | rpc & in-process        |


docs: fatal codes, re-init, and retry policy

b4cc836

Signed-off-by: Todd Baert <[email protected]>

toddbaert requested review from a team as code owners October 30, 2025 16:23

dosubot bot added the size:L This PR changes 100-499 lines, ignoring generated files. label Oct 30, 2025

toddbaert commented Oct 30, 2025

View reviewed changes

gemini-code-assist bot reviewed Oct 30, 2025

View reviewed changes

docs/reference/specifications/providers.md Outdated Show resolved Hide resolved

docs/reference/specifications/providers.md Outdated Show resolved Hide resolved

docs/reference/specifications/providers.md Outdated Show resolved Hide resolved

toddbaert commented Oct 30, 2025

View reviewed changes

toddbaert and others added 4 commits October 30, 2025 12:32

fixup: json

8a0b6f1

Signed-off-by: Todd Baert <[email protected]>

Update docs/reference/specifications/providers.md

f749674

Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com> Signed-off-by: Todd Baert <[email protected]>

fixup: typo

18363a9

Signed-off-by: Todd Baert <[email protected]>

Update docs/reference/specifications/providers.md

48a46ea

Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com> Signed-off-by: Todd Baert <[email protected]>

aepfli reviewed Oct 31, 2025

View reviewed changes

alexandraoberaigner reviewed Nov 10, 2025

View reviewed changes

aepfli mentioned this pull request Nov 11, 2025

feat: add missing steps for config and improve wording open-feature/flagd-testbed#311

Open

alexandraoberaigner mentioned this pull request Nov 17, 2025

Infinite retry to establish connection to FlagSyncService in Flagd golang provider open-feature/go-sdk-contrib#756

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

docs: fatal codes, re-init, and retry policy #1818

docs: fatal codes, re-init, and retry policy #1818

Uh oh!

toddbaert commented Oct 30, 2025

Uh oh!

netlify bot commented Oct 30, 2025 •

edited

Loading

Uh oh!

gemini-code-assist bot commented Oct 30, 2025

Uh oh!

toddbaert Oct 30, 2025 •

edited

Loading

Uh oh!

gemini-code-assist bot left a comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

toddbaert Oct 30, 2025

Uh oh!

toddbaert Oct 30, 2025

Uh oh!

sonarqubecloud bot commented Oct 30, 2025

Uh oh!

aepfli Oct 31, 2025 •

edited

Loading

Uh oh!

aepfli Oct 31, 2025

Uh oh!

alexandraoberaigner Nov 10, 2025 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

	\| fatalStatusCodes \| - \| a list of gRPC status codes, which will cause streams to give up and put the provider in a PROVIDER_FATAL state \| array \| [] \| rpc & in-process \|
	\| fatalStatusCodes \| FLAGD_FATAL_STATUS_CODES \| a list of gRPC status codes, which will cause streams to give up and put the provider in a PROVIDER_FATAL state \| array \| [] \| rpc & in-process \|

docs: fatal codes, re-init, and retry policy #1818

Are you sure you want to change the base?

docs: fatal codes, re-init, and retry policy #1818

Uh oh!

Conversation

toddbaert commented Oct 30, 2025

Uh oh!

netlify bot commented Oct 30, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

✅ Deploy Preview for polite-licorice-3db33c ready!

Uh oh!

gemini-code-assist bot commented Oct 30, 2025

Summary of Changes

Highlights

Footnotes

Uh oh!

toddbaert Oct 30, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

Uh oh!

Uh oh!

Uh oh!

toddbaert Oct 30, 2025

Choose a reason for hiding this comment

Uh oh!

toddbaert Oct 30, 2025

Choose a reason for hiding this comment

Uh oh!

sonarqubecloud bot commented Oct 30, 2025

Quality Gate passed

Uh oh!

aepfli Oct 31, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

aepfli Oct 31, 2025

Choose a reason for hiding this comment

Uh oh!

alexandraoberaigner Nov 10, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

netlify bot commented Oct 30, 2025 •

edited

Loading

toddbaert Oct 30, 2025 •

edited

Loading

aepfli Oct 31, 2025 •

edited

Loading

alexandraoberaigner Nov 10, 2025 •

edited

Loading