-
Notifications
You must be signed in to change notification settings - Fork 93
docs: fatal codes, re-init, and retry policy #1818
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Changes from all commits
b4cc836
8a0b6f1
f749674
18363a9
48a46ea
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
| Original file line number | Diff line number | Diff line change | ||||
|---|---|---|---|---|---|---|
|
|
@@ -64,18 +64,21 @@ stateDiagram-v2 | |||||
| NOT_READY --> ERROR: initialize | ||||||
| READY --> ERROR: disconnected, disconnected period == 0 | ||||||
| READY --> STALE: disconnected, disconnect period < retry grace period | ||||||
| READY --> NOT_READY: shutdown | ||||||
| STALE --> ERROR: disconnect period >= retry grace period | ||||||
| STALE --> NOT_READY: shutdown | ||||||
| ERROR --> READY: reconnected | ||||||
| ERROR --> [*]: shutdown | ||||||
| ERROR --> NOT_READY: shutdown | ||||||
| ERROR --> [*]: Error code == PROVIDER_FATAL | ||||||
|
|
||||||
| note right of STALE | ||||||
| note left of STALE | ||||||
| stream disconnected, attempting to reconnect, | ||||||
| resolve from cache* | ||||||
| resolve from flag set rules** | ||||||
| STALE emitted | ||||||
| end note | ||||||
|
|
||||||
| note right of READY | ||||||
| note left of READY | ||||||
| stream connected, | ||||||
| evaluation cache active*, | ||||||
| flag set rules stored**, | ||||||
|
|
@@ -84,7 +87,7 @@ stateDiagram-v2 | |||||
| CHANGE emitted with stream messages | ||||||
| end note | ||||||
|
|
||||||
| note right of ERROR | ||||||
| note left of ERROR | ||||||
| stream disconnected, attempting to reconnect, | ||||||
| evaluation cache purged*, | ||||||
| ERROR emitted | ||||||
|
|
@@ -101,25 +104,49 @@ stateDiagram-v2 | |||||
|
|
||||||
| ### Stream Reconnection | ||||||
|
|
||||||
| When either stream (sync or event) disconnects, whether due to the associated deadline being exceeded, network error or any other cause, the provider attempts to re-establish the stream immediately, and then retries with an exponential back-off. | ||||||
| We always rely on the [integrated functionality of GRPC for reconnection](https://github.com/grpc/grpc/blob/master/doc/connection-backoff.md) and utilize [Wait-for-Ready](https://grpc.io/docs/guides/wait-for-ready/) to re-establish the stream. | ||||||
| We are configuring the underlying reconnection mechanism whenever we can, based on our configuration. (not all GRPC implementations support this) | ||||||
| When either stream (sync or event) disconnects, whether due to the associated deadline being exceeded, network error or any other cause, the provider attempts to re-establish the stream immediately. | ||||||
|
Member
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Take this with a salt of grain (as i am not sure i understood it correctly), but there are two different things, reconnection and retry. But my knowledge might be off here, so the reconnect is happening on the channel versus the retry is for the stream. So i do think that this table might be interesting for people to see how our reconnection attempt on a lost channel looks like. |
||||||
| Both the RPC and sync streams will forever attempt to reconnect unless the stream response indicates a [fatal status code](#fatal-status-codes). | ||||||
| This is distinct from the [gRPC retry-policy](#grpc-retry-policy), which automatically retries *all RPCs* (streams or otherwise) a limited number of times to make the provider resilient to transient errors. | ||||||
|
|
||||||
| | language/property | min connect timeout | max backoff | initial backoff | jitter | multiplier | | ||||||
| |-------------------|-----------------------------------|--------------------------|--------------------------|--------|------------| | ||||||
| | GRPC property | grpc.initial_reconnect_backoff_ms | max_reconnect_backoff_ms | min_reconnect_backoff_ms | 0.2 | 1.6 | | ||||||
| | Flagd property | deadlineMs | retryBackoffMaxMs | retryBackoffMs | 0.2 | 1.6 | | ||||||
| | --- | --- | --- | --- | --- | --- | | ||||||
| | default [^1] | ✅ | ✅ | ✅ | 0.2 | 1.6 | | ||||||
| | js | ✅ | ✅ | ❌ | 0.2 | 1.6 | | ||||||
| | java | ❌ | ❌ | ❌ | 0.2 | 1.6 | | ||||||
| ## gRPC Retry Policy | ||||||
|
|
||||||
| [^1] : C++, Python, Ruby, Objective-C, PHP, C#, js(deprecated) | ||||||
| flagd leverages gRPC built-in retry mechanism for all RPCs. | ||||||
| In short, the retry policy attempts to retry all RPCs which return `UNAVAILABLE` or `UNKNOWN` status codes 3 times, with a 1s, 2s, 4s, backoff respectively. | ||||||
| No other status codes are retried. | ||||||
| The flagd gRPC retry policy is specified below: | ||||||
|
|
||||||
| When disconnected, if the time since disconnection is less than `retryGracePeriod`, the provider emits `STALE` when it disconnects. | ||||||
|
Member
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I am not sure if just adding this to the overview in the mermaid chart is sufficient enough, i think this should be also explicitly mentioned |
||||||
| While the provider is in state `STALE` the provider resolves values from its cache or stored flag set rules, depending on its resolver mode. | ||||||
| When the time since the last disconnect first exceeds `retryGracePeriod`, the provider emits `ERROR`. | ||||||
| The provider attempts to reconnect indefinitely, with a maximum interval of `retryBackoffMaxMs`. | ||||||
| ```json | ||||||
|
Member
Author
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. This is standard retryPolicy, accepted in this JSON format by most gRPC implementations. |
||||||
| { | ||||||
| "methodConfig": [ | ||||||
| { | ||||||
| "name": [ | ||||||
| { | ||||||
| "service": "flagd.evaluation.v1.Service" | ||||||
| }, | ||||||
| { | ||||||
| "service": "flagd.sync.v1.FlagSyncService" | ||||||
| } | ||||||
| ], | ||||||
| "retryPolicy": { | ||||||
| "MaxAttempts": 4, | ||||||
| "InitialBackoff": "1s", | ||||||
| "MaxBackoff": $FLAGD_RETRY_BACKOFF_MAX_MS, // from provider options | ||||||
| "BackoffMultiplier": 2.0, | ||||||
| "RetryableStatusCodes": [ | ||||||
| "UNAVAILABLE", | ||||||
| "UNKNOWN" | ||||||
| ] | ||||||
| } | ||||||
| } | ||||||
| ] | ||||||
| } | ||||||
| ``` | ||||||
|
|
||||||
| ## Fatal Status Codes | ||||||
|
|
||||||
| Providers accept an option for defining fatal gRPC status codes which, when received in the RPC or sync streams, transition the provider to the PROVIDER_FATAL state. | ||||||
| This configuration is useful for situations wherein these codes indicate to a client that their configuration is invalid and must be changed (i.e., the error is non-transient). | ||||||
| Examples for this include status codes such as `UNAUTHENTICATED` or `PERMISSION_DENIED`. | ||||||
|
|
||||||
| ## RPC Resolver | ||||||
|
|
||||||
|
|
@@ -262,28 +289,29 @@ precedence. | |||||
|
|
||||||
| Below are the supported configuration parameters (note that not all apply to both resolver modes): | ||||||
|
|
||||||
| | Option name | Environment variable name | Explanation | Type & Values | Default | Compatible resolver | | ||||||
| | --------------------- | ------------------------------ | ---------------------------------------------------------------------- | ---------------------------- | ----------------------------- | ----------------------- | | ||||||
| | resolver | FLAGD_RESOLVER | mode of operation | String - `rpc`, `in-process` | rpc | rpc & in-process | | ||||||
| | host | FLAGD_HOST | remote host | String | localhost | rpc & in-process | | ||||||
| | port | FLAGD_PORT | remote port | int | 8013 (rpc), 8015 (in-process) | rpc & in-process | | ||||||
| | targetUri | FLAGD_TARGET_URI | alternative to host/port, supporting custom name resolution | string | null | rpc & in-process | | ||||||
| | tls | FLAGD_TLS | connection encryption | boolean | false | rpc & in-process | | ||||||
| | socketPath | FLAGD_SOCKET_PATH | alternative to host port, unix socket | String | null | rpc & in-process | | ||||||
| | certPath | FLAGD_SERVER_CERT_PATH | tls cert path | String | null | rpc & in-process | | ||||||
| | deadlineMs | FLAGD_DEADLINE_MS | deadline for unary calls, and timeout for initialization | int | 500 | rpc & in-process & file | | ||||||
| | streamDeadlineMs | FLAGD_STREAM_DEADLINE_MS | deadline for streaming calls, useful as an application-layer keepalive | int | 600000 | rpc & in-process | | ||||||
| | retryBackoffMs | FLAGD_RETRY_BACKOFF_MS | initial backoff for stream retry | int | 1000 | rpc & in-process | | ||||||
| | retryBackoffMaxMs | FLAGD_RETRY_BACKOFF_MAX_MS | maximum backoff for stream retry | int | 120000 | rpc & in-process | | ||||||
| | retryGracePeriod | FLAGD_RETRY_GRACE_PERIOD | period in seconds before provider moves from STALE to ERROR state | int | 5 | rpc & in-process & file | | ||||||
| | keepAliveTime | FLAGD_KEEP_ALIVE_TIME_MS | http 2 keepalive | long | 0 | rpc & in-process | | ||||||
| | cache | FLAGD_CACHE | enable cache of static flags | String - `lru`, `disabled` | lru | rpc | | ||||||
| | maxCacheSize | FLAGD_MAX_CACHE_SIZE | max size of static flag cache | int | 1000 | rpc | | ||||||
| | selector | FLAGD_SOURCE_SELECTOR | selects a single sync source to retrieve flags from only that source | string | null | in-process | | ||||||
| | providerId | FLAGD_PROVIDER_ID | A unique identifier for flagd(grpc client) initiating the request. | string | null | in-process | | ||||||
| | offlineFlagSourcePath | FLAGD_OFFLINE_FLAG_SOURCE_PATH | offline, file-based flag definitions, overrides host/port/targetUri | string | null | file | | ||||||
| | offlinePollIntervalMs | FLAGD_OFFLINE_POLL_MS | poll interval for reading offlineFlagSourcePath | int | 5000 | file | | ||||||
| | contextEnricher | - | sync-metadata to evaluation context mapping function | function | identity function | in-process | | ||||||
| | Option name | Environment variable name | Explanation | Type & Values | Default | Compatible resolver | | ||||||
| | --------------------- | ------------------------------ | --------------------------------------------------------------------------------------------------------------- | ---------------------------- | ----------------------------- | ----------------------- | | ||||||
| | resolver | FLAGD_RESOLVER | mode of operation | string - `rpc`, `in-process` | rpc | rpc & in-process | | ||||||
| | host | FLAGD_HOST | remote host | string | localhost | rpc & in-process | | ||||||
| | port | FLAGD_PORT | remote port | int | 8013 (rpc), 8015 (in-process) | rpc & in-process | | ||||||
| | targetUri | FLAGD_TARGET_URI | alternative to host/port, supporting custom name resolution | string | null | rpc & in-process | | ||||||
| | tls | FLAGD_TLS | connection encryption | boolean | false | rpc & in-process | | ||||||
| | socketPath | FLAGD_SOCKET_PATH | alternative to host port, unix socket | string | null | rpc & in-process | | ||||||
| | certPath | FLAGD_SERVER_CERT_PATH | tls cert path | string | null | rpc & in-process | | ||||||
| | deadlineMs | FLAGD_DEADLINE_MS | deadline for unary calls, and timeout for initialization | int | 500 | rpc & in-process & file | | ||||||
| | streamDeadlineMs | FLAGD_STREAM_DEADLINE_MS | deadline for streaming calls, useful as an application-layer keepalive | int | 600000 | rpc & in-process | | ||||||
| | retryBackoffMs | FLAGD_RETRY_BACKOFF_MS | initial backoff for stream retry | int | 1000 | rpc & in-process | | ||||||
| | retryBackoffMaxMs | FLAGD_RETRY_BACKOFF_MAX_MS | maximum backoff for stream retry | int | 120000 | rpc & in-process | | ||||||
| | retryGracePeriod | FLAGD_RETRY_GRACE_PERIOD | period in seconds before provider moves from STALE to ERROR state | int | 5 | rpc & in-process & file | | ||||||
| | keepAliveTime | FLAGD_KEEP_ALIVE_TIME_MS | http 2 keepalive | long | 0 | rpc & in-process | | ||||||
| | cache | FLAGD_CACHE | enable cache of static flags | string - `lru`, `disabled` | lru | rpc | | ||||||
| | maxCacheSize | FLAGD_MAX_CACHE_SIZE | max size of static flag cache | int | 1000 | rpc | | ||||||
| | selector | FLAGD_SOURCE_SELECTOR | selects a single sync source to retrieve flags from only that source | string | null | in-process | | ||||||
| | providerId | FLAGD_PROVIDER_ID | A unique identifier for flagd(grpc client) initiating the request. | string | null | in-process | | ||||||
| | offlineFlagSourcePath | FLAGD_OFFLINE_FLAG_SOURCE_PATH | offline, file-based flag definitions, overrides host/port/targetUri | string | null | file | | ||||||
| | offlinePollIntervalMs | FLAGD_OFFLINE_POLL_MS | poll interval for reading offlineFlagSourcePath | int | 5000 | file | | ||||||
| | contextEnricher | - | sync-metadata to evaluation context mapping function | function | identity function | in-process | | ||||||
| | fatalStatusCodes | - | a list of gRPC status codes, which will cause streams to give up and put the provider in a PROVIDER_FATAL state | array | [] | rpc & in-process | | ||||||
|
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more.
Suggested change
@toddbaert @aepfli We should probably make sure the change is consistent with: |
||||||
|
|
||||||
| ### Custom Name Resolution | ||||||
|
|
||||||
|
|
||||||
Uh oh!
There was an error while loading. Please reload this page.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
old:
new:
The main different is we make it clear transitions are possible from non-fatal
ERROR, back toNOT_READY... many implementations already support this, but not all.I think it makes sense to specify this so we can be consistent.