fix: OIDC token refresh recovery after network disruption#991
fix: OIDC token refresh recovery after network disruption#991dmuensterer wants to merge 7 commits intoopenziti:mainfrom
Conversation
A brief network disruption permanently breaks the OIDC tlsuv_http_t handle. Subsequent refresh attempts retry on the same broken handle and fail indefinitely. With only a 30-second refresh margin, retries are exhausted before the token expires, leading to UNAUTHORIZED errors that tear down all channels and require a full restart. Three changes: 1. Reset broken HTTP handle after consecutive failures Add a refresh_failures counter. After 3 consecutive connection failures (code < 0), call tlsuv_http_cancel_all() to close the transport and reset the handle to Disconnected state, allowing tlsuv's auto-reconnect to establish a fresh connection. Also fixes a bug where the if-chain in refresh_cb used non-exclusive conditions, causing UV_EOF to trigger both "restart auth" and "5s retry" simultaneously. 2. Add 15s connect timeout to OIDC client The OIDC HTTP client was missing tlsuv_http_connect_timeout(), inheriting the kernel default (~130s). A single timeout could consume the entire refresh window. Now matches the controller client's 15s timeout. 3. Use half-lifetime refresh margin instead of fixed 30s Change token refresh scheduling from (expires_in - 30) to (expires_in / 2). For a typical 300s token, this gives 150s of retry time (~30 attempts) instead of 30s (~6 attempts).
The refresh_cb retry loop resets refresh_failures to 0 after every 3-failure handle reset cycle, so it never escapes the retry loop even when the server is persistently unreachable. Add total_refresh_failures counter that accumulates across reset cycles. After 60 total failures (~5 min at 5s intervals), discard tokens and restart full OIDC auth via oidc_client_start(), giving the best chance of recovery once the server is healthy again.
library/oidc.c
Outdated
| OIDC_LOG(WARN, "OIDC token refresh failed (%d/%s), attempt %d (total: %d)", | ||
| http_resp->code, err, clt->refresh_failures, clt->total_refresh_failures); | ||
|
|
||
| // after sustained failure (5 min at 5s intervals), give up on refresh and restart full auth |
There was a problem hiding this comment.
This looks good.
Just a couple of things here:
- at default expiration window -- it would give up a third (10 minutes) of token lifetime (even more if window is higher). This may lead to unwanted downstream effects -- like channels/connections may be dropped prematurely.
- the better way would be to try to utilize the full refresh window with randomized exponential backoff to avoid stampeding controller after a network outage.
There was a problem hiding this comment.
Good point. I replaced the fixed 60-failure threshold with token-expiry-based escalation. It now retries with randomized exponential backoff (5s→10s→20s→40s→60s cap, jittered to [delay/2, delay]) for the entire refresh window until the token actually expires. Added token_expiry field to track the absolute expiry time. Full re-auth only triggers when uv_now() >= token_expiry.
library/oidc.c
Outdated
| return; | ||
| } | ||
|
|
||
| // http_resp->code >= 0 but not 200: server-side rejection (e.g. EOF, 401, etc.) |
There was a problem hiding this comment.
EOF(UV_EOF) is negative, so it will be covered in the if(http_resp->code < 0) clause above
There was a problem hiding this comment.
Ofc you are right, I removed it. I fixed the comment on the remaining branch to say "e.g. 401, 403" instead of "e.g. EOF, 401" since EOF won't reach this clause here?
ekoby
left a comment
There was a problem hiding this comment.
looks good
just a couple of possible improvements -- let me know if you want to take a crack at them, if not I can put them on top of your changes
|
BTW, add yourself to CONTRIBUTORS list if you so wish |
Replace fixed 5s retry interval and 60-failure threshold with randomized exponential backoff (5s-60s) that utilizes the full token refresh window. Escalate to full re-auth only when the token actually expires, not after a fixed number of attempts.
|
All contributors have signed the CLA ✍️ ✅ |
|
I have read the CLA Document and I hereby sign the CLA |
library/oidc.c
Outdated
| return; | ||
| } | ||
|
|
||
| if (clt->refresh_failures >= 3) { |
There was a problem hiding this comment.
this should be removed now, right?
| bool need_refresh; | ||
| struct auth_req *request; | ||
| tlsuv_http_req_t *refresh_req; | ||
| int refresh_failures; |
There was a problem hiding this comment.
I think you only need one counter. I see they are set and incremented at the same time
Merge total_refresh_failures into refresh_failures since both were always incremented and reset together. Remove the periodic 3-failure connection reset, which is no longer needed with exponential backoff.
Guard against missing or invalid Location headers, query strings, and expected query parameters in code_cb and auth_cb. Also truncate authRequestID at '&' to avoid capturing trailing query parameters.
| } | ||
| p += strlen("authRequestID="); | ||
| req->id = strdup(p); | ||
| char *end = strchr(p, '&'); // stop at next query param to avoid capturing trailing params |
There was a problem hiding this comment.
If the URL looks e.g. like: /oidc/authorize?authRequestID=abc123&state=xyz
Then req->id becomes abc123&state=xyz instead of just abc123. That polluted ID then gets used in:
snprintf(path, sizeof(path), "/oidc/login/cert?id=%s", req->id);
|
recheck |
|
recheck cla again please |
|
recheck cla again... |
|
@dmuensterer we'd like to merge this PR. don't worry about fixing windows build. I can take care of it after |
A brief network disruption permanently breaks the OIDC tlsuv_http_t handle. Subsequent refresh attempts retry on the same broken handle and fail indefinitely. With only a 30-second refresh margin, retries are exhausted before the token expires, leading to UNAUTHORIZED errors that tear down all channels and require a full restart.
Three changes:
Reset broken HTTP handle after consecutive failures Add a refresh_failures counter. After 3 consecutive connection failures (code < 0), call tlsuv_http_cancel_all() to close the transport and reset the handle to Disconnected state, allowing tlsuv's auto-reconnect to establish a fresh connection. Also fixes a bug where the if-chain in refresh_cb used non-exclusive conditions, causing UV_EOF to trigger both "restart auth" and "5s retry" simultaneously.
Add 15s connect timeout to OIDC client The OIDC HTTP client was missing tlsuv_http_connect_timeout(), inheriting the kernel default (~130s). A single timeout could consume the entire refresh window. Now matches the controller client's 15s timeout.
Use half-lifetime refresh margin instead of fixed 30s Change token refresh scheduling from (expires_in - 30) to (expires_in / 2). For a typical 1800s token, this gives 900s of retry time (~120 attempts) instead of 30s (~6 attempts).