Skip to content
Open
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
112 changes: 70 additions & 42 deletions docs/reference/specifications/providers.md
Original file line number Diff line number Diff line change
Expand Up @@ -64,18 +64,21 @@ stateDiagram-v2
NOT_READY --> ERROR: initialize
READY --> ERROR: disconnected, disconnected period == 0
READY --> STALE: disconnected, disconnect period < retry grace period
READY --> NOT_READY: shutdown
STALE --> ERROR: disconnect period >= retry grace period
STALE --> NOT_READY: shutdown
ERROR --> READY: reconnected
ERROR --> [*]: shutdown
ERROR --> NOT_READY: shutdown
ERROR --> [*]: Error code == PROVIDER_FATAL

note right of STALE
note left of STALE
Comment on lines +69 to +74
Copy link
Member Author

@toddbaert toddbaert Oct 30, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

old:

Image

new:

Image

The main different is we make it clear transitions are possible from non-fatal ERROR, back to NOT_READY... many implementations already support this, but not all.
I think it makes sense to specify this so we can be consistent.

stream disconnected, attempting to reconnect,
resolve from cache*
resolve from flag set rules**
STALE emitted
end note

note right of READY
note left of READY
stream connected,
evaluation cache active*,
flag set rules stored**,
Expand All @@ -84,7 +87,7 @@ stateDiagram-v2
CHANGE emitted with stream messages
end note

note right of ERROR
note left of ERROR
stream disconnected, attempting to reconnect,
evaluation cache purged*,
ERROR emitted
Expand All @@ -101,25 +104,49 @@ stateDiagram-v2

### Stream Reconnection

When either stream (sync or event) disconnects, whether due to the associated deadline being exceeded, network error or any other cause, the provider attempts to re-establish the stream immediately, and then retries with an exponential back-off.
We always rely on the [integrated functionality of GRPC for reconnection](https://github.com/grpc/grpc/blob/master/doc/connection-backoff.md) and utilize [Wait-for-Ready](https://grpc.io/docs/guides/wait-for-ready/) to re-establish the stream.
We are configuring the underlying reconnection mechanism whenever we can, based on our configuration. (not all GRPC implementations support this)
When either stream (sync or event) disconnects, whether due to the associated deadline being exceeded, network error or any other cause, the provider attempts to re-establish the stream immediately.
Copy link
Member

@aepfli aepfli Oct 31, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Take this with a salt of grain (as i am not sure i understood it correctly), but there are two different things, reconnection and retry. But my knowledge might be off here, so the reconnect is happening on the channel versus the retry is for the stream. So i do think that this table might be interesting for people to see how our reconnection attempt on a lost channel looks like.

Both the RPC and sync streams will forever attempt to reconnect unless the stream response indicates a [fatal status code](#fatal-status-codes).
This is distinct from the [gRPC retry-policy](#grpc-retry-policy), which automatically retries *all RPCs* (streams or otherwise) a limited number of times to make the provider resilient to transient errors.

| language/property | min connect timeout | max backoff | initial backoff | jitter | multiplier |
|-------------------|-----------------------------------|--------------------------|--------------------------|--------|------------|
| GRPC property | grpc.initial_reconnect_backoff_ms | max_reconnect_backoff_ms | min_reconnect_backoff_ms | 0.2 | 1.6 |
| Flagd property | deadlineMs | retryBackoffMaxMs | retryBackoffMs | 0.2 | 1.6 |
| --- | --- | --- | --- | --- | --- |
| default [^1] | ✅ | ✅ | ✅ | 0.2 | 1.6 |
| js | ✅ | ✅ | ❌ | 0.2 | 1.6 |
| java | ❌ | ❌ | ❌ | 0.2 | 1.6 |
## gRPC Retry Policy

[^1] : C++, Python, Ruby, Objective-C, PHP, C#, js(deprecated)
flagd leverages gRPC built-in retry mechanism for all RPCs.
In short, the retry policy attempts to retry all RPCs which return `UNAVAILABLE` or `UNKNOWN` status codes 3 times, with a 1s, 2s, 4s, backoff respectively.
No other status codes are retried.
The flagd gRPC retry policy is specified below:

When disconnected, if the time since disconnection is less than `retryGracePeriod`, the provider emits `STALE` when it disconnects.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am not sure if just adding this to the overview in the mermaid chart is sufficient enough, i think this should be also explicitly mentioned

While the provider is in state `STALE` the provider resolves values from its cache or stored flag set rules, depending on its resolver mode.
When the time since the last disconnect first exceeds `retryGracePeriod`, the provider emits `ERROR`.
The provider attempts to reconnect indefinitely, with a maximum interval of `retryBackoffMaxMs`.
```json
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is standard retryPolicy, accepted in this JSON format by most gRPC implementations.

{
"methodConfig": [
{
"name": [
{
"service": "flagd.evaluation.v1.Service"
},
{
"service": "flagd.sync.v1.FlagSyncService"
}
],
"retryPolicy": {
"MaxAttempts": 4,
"InitialBackoff": "1s",
"MaxBackoff": $FLAGD_RETRY_BACKOFF_MAX_MS, // from provider options
"BackoffMultiplier": 2.0,
"RetryableStatusCodes": [
"UNAVAILABLE",
"UNKNOWN"
]
}
}
]
}
```

## Fatal Status Codes

Providers accept an option for defining fatal gRPC status codes which, when received in the RPC or sync streams, transition the provider to the PROVIDER_FATAL state.
This configuration is useful for situations wherein these codes indicate to a client that their configuration is invalid and must be changed (i.e., the error is non-transient).
Examples for this include status codes such as `UNAUTHENTICATED` or `PERMISSION_DENIED`.

## RPC Resolver

Expand Down Expand Up @@ -262,28 +289,29 @@ precedence.

Below are the supported configuration parameters (note that not all apply to both resolver modes):

| Option name | Environment variable name | Explanation | Type & Values | Default | Compatible resolver |
| --------------------- | ------------------------------ | ---------------------------------------------------------------------- | ---------------------------- | ----------------------------- | ----------------------- |
| resolver | FLAGD_RESOLVER | mode of operation | String - `rpc`, `in-process` | rpc | rpc & in-process |
| host | FLAGD_HOST | remote host | String | localhost | rpc & in-process |
| port | FLAGD_PORT | remote port | int | 8013 (rpc), 8015 (in-process) | rpc & in-process |
| targetUri | FLAGD_TARGET_URI | alternative to host/port, supporting custom name resolution | string | null | rpc & in-process |
| tls | FLAGD_TLS | connection encryption | boolean | false | rpc & in-process |
| socketPath | FLAGD_SOCKET_PATH | alternative to host port, unix socket | String | null | rpc & in-process |
| certPath | FLAGD_SERVER_CERT_PATH | tls cert path | String | null | rpc & in-process |
| deadlineMs | FLAGD_DEADLINE_MS | deadline for unary calls, and timeout for initialization | int | 500 | rpc & in-process & file |
| streamDeadlineMs | FLAGD_STREAM_DEADLINE_MS | deadline for streaming calls, useful as an application-layer keepalive | int | 600000 | rpc & in-process |
| retryBackoffMs | FLAGD_RETRY_BACKOFF_MS | initial backoff for stream retry | int | 1000 | rpc & in-process |
| retryBackoffMaxMs | FLAGD_RETRY_BACKOFF_MAX_MS | maximum backoff for stream retry | int | 120000 | rpc & in-process |
| retryGracePeriod | FLAGD_RETRY_GRACE_PERIOD | period in seconds before provider moves from STALE to ERROR state | int | 5 | rpc & in-process & file |
| keepAliveTime | FLAGD_KEEP_ALIVE_TIME_MS | http 2 keepalive | long | 0 | rpc & in-process |
| cache | FLAGD_CACHE | enable cache of static flags | String - `lru`, `disabled` | lru | rpc |
| maxCacheSize | FLAGD_MAX_CACHE_SIZE | max size of static flag cache | int | 1000 | rpc |
| selector | FLAGD_SOURCE_SELECTOR | selects a single sync source to retrieve flags from only that source | string | null | in-process |
| providerId | FLAGD_PROVIDER_ID | A unique identifier for flagd(grpc client) initiating the request. | string | null | in-process |
| offlineFlagSourcePath | FLAGD_OFFLINE_FLAG_SOURCE_PATH | offline, file-based flag definitions, overrides host/port/targetUri | string | null | file |
| offlinePollIntervalMs | FLAGD_OFFLINE_POLL_MS | poll interval for reading offlineFlagSourcePath | int | 5000 | file |
| contextEnricher | - | sync-metadata to evaluation context mapping function | function | identity function | in-process |
| Option name | Environment variable name | Explanation | Type & Values | Default | Compatible resolver |
| --------------------- | ------------------------------ | --------------------------------------------------------------------------------------------------------------- | ---------------------------- | ----------------------------- | ----------------------- |
| resolver | FLAGD_RESOLVER | mode of operation | string - `rpc`, `in-process` | rpc | rpc & in-process |
| host | FLAGD_HOST | remote host | string | localhost | rpc & in-process |
| port | FLAGD_PORT | remote port | int | 8013 (rpc), 8015 (in-process) | rpc & in-process |
| targetUri | FLAGD_TARGET_URI | alternative to host/port, supporting custom name resolution | string | null | rpc & in-process |
| tls | FLAGD_TLS | connection encryption | boolean | false | rpc & in-process |
| socketPath | FLAGD_SOCKET_PATH | alternative to host port, unix socket | string | null | rpc & in-process |
| certPath | FLAGD_SERVER_CERT_PATH | tls cert path | string | null | rpc & in-process |
| deadlineMs | FLAGD_DEADLINE_MS | deadline for unary calls, and timeout for initialization | int | 500 | rpc & in-process & file |
| streamDeadlineMs | FLAGD_STREAM_DEADLINE_MS | deadline for streaming calls, useful as an application-layer keepalive | int | 600000 | rpc & in-process |
| retryBackoffMs | FLAGD_RETRY_BACKOFF_MS | initial backoff for stream retry | int | 1000 | rpc & in-process |
| retryBackoffMaxMs | FLAGD_RETRY_BACKOFF_MAX_MS | maximum backoff for stream retry | int | 120000 | rpc & in-process |
| retryGracePeriod | FLAGD_RETRY_GRACE_PERIOD | period in seconds before provider moves from STALE to ERROR state | int | 5 | rpc & in-process & file |
| keepAliveTime | FLAGD_KEEP_ALIVE_TIME_MS | http 2 keepalive | long | 0 | rpc & in-process |
| cache | FLAGD_CACHE | enable cache of static flags | string - `lru`, `disabled` | lru | rpc |
| maxCacheSize | FLAGD_MAX_CACHE_SIZE | max size of static flag cache | int | 1000 | rpc |
| selector | FLAGD_SOURCE_SELECTOR | selects a single sync source to retrieve flags from only that source | string | null | in-process |
| providerId | FLAGD_PROVIDER_ID | A unique identifier for flagd(grpc client) initiating the request. | string | null | in-process |
| offlineFlagSourcePath | FLAGD_OFFLINE_FLAG_SOURCE_PATH | offline, file-based flag definitions, overrides host/port/targetUri | string | null | file |
| offlinePollIntervalMs | FLAGD_OFFLINE_POLL_MS | poll interval for reading offlineFlagSourcePath | int | 5000 | file |
| contextEnricher | - | sync-metadata to evaluation context mapping function | function | identity function | in-process |
| fatalStatusCodes | - | a list of gRPC status codes, which will cause streams to give up and put the provider in a PROVIDER_FATAL state | array | [] | rpc & in-process |
Copy link
Contributor

@alexandraoberaigner alexandraoberaigner Nov 10, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
| fatalStatusCodes | - | a list of gRPC status codes, which will cause streams to give up and put the provider in a PROVIDER_FATAL state | array | [] | rpc & in-process |
| fatalStatusCodes | FLAGD_FATAL_STATUS_CODES | a list of gRPC status codes, which will cause streams to give up and put the provider in a PROVIDER_FATAL state | array | [] | rpc & in-process |

@toddbaert @aepfli We should probably make sure the change is consistent with:
https://github.com/open-feature/flagd-testbed/pull/311/files#diff-2f3b6fc7d0eec288e7349f23f8f56b197eecf05fef9320930ee266dda60fd6e7R25


### Custom Name Resolution

Expand Down