Skip to content

fix: OIDC token refresh recovery after network disruption#991

Open
dmuensterer wants to merge 7 commits intoopenziti:mainfrom
dmuensterer:fix/oidc-reconnect-refresh
Open

fix: OIDC token refresh recovery after network disruption#991
dmuensterer wants to merge 7 commits intoopenziti:mainfrom
dmuensterer:fix/oidc-reconnect-refresh

Conversation

@dmuensterer
Copy link

@dmuensterer dmuensterer commented Feb 11, 2026

A brief network disruption permanently breaks the OIDC tlsuv_http_t handle. Subsequent refresh attempts retry on the same broken handle and fail indefinitely. With only a 30-second refresh margin, retries are exhausted before the token expires, leading to UNAUTHORIZED errors that tear down all channels and require a full restart.

Three changes:

  1. Reset broken HTTP handle after consecutive failures Add a refresh_failures counter. After 3 consecutive connection failures (code < 0), call tlsuv_http_cancel_all() to close the transport and reset the handle to Disconnected state, allowing tlsuv's auto-reconnect to establish a fresh connection. Also fixes a bug where the if-chain in refresh_cb used non-exclusive conditions, causing UV_EOF to trigger both "restart auth" and "5s retry" simultaneously.

  2. Add 15s connect timeout to OIDC client The OIDC HTTP client was missing tlsuv_http_connect_timeout(), inheriting the kernel default (~130s). A single timeout could consume the entire refresh window. Now matches the controller client's 15s timeout.

  3. Use half-lifetime refresh margin instead of fixed 30s Change token refresh scheduling from (expires_in - 30) to (expires_in / 2). For a typical 1800s token, this gives 900s of retry time (~120 attempts) instead of 30s (~6 attempts).

A brief network disruption permanently breaks the OIDC tlsuv_http_t
handle. Subsequent refresh attempts retry on the same broken handle and
fail indefinitely. With only a 30-second refresh margin, retries are
exhausted before the token expires, leading to UNAUTHORIZED errors that
tear down all channels and require a full restart.

Three changes:

1. Reset broken HTTP handle after consecutive failures
   Add a refresh_failures counter. After 3 consecutive connection
   failures (code < 0), call tlsuv_http_cancel_all() to close the
   transport and reset the handle to Disconnected state, allowing
   tlsuv's auto-reconnect to establish a fresh connection.
   Also fixes a bug where the if-chain in refresh_cb used non-exclusive
   conditions, causing UV_EOF to trigger both "restart auth" and
   "5s retry" simultaneously.

2. Add 15s connect timeout to OIDC client
   The OIDC HTTP client was missing tlsuv_http_connect_timeout(),
   inheriting the kernel default (~130s). A single timeout could
   consume the entire refresh window. Now matches the controller
   client's 15s timeout.

3. Use half-lifetime refresh margin instead of fixed 30s
   Change token refresh scheduling from (expires_in - 30) to
   (expires_in / 2). For a typical 300s token, this gives 150s of
   retry time (~30 attempts) instead of 30s (~6 attempts).
The refresh_cb retry loop resets refresh_failures to 0 after every
3-failure handle reset cycle, so it never escapes the retry loop even
when the server is persistently unreachable. Add total_refresh_failures
counter that accumulates across reset cycles. After 60 total failures
(~5 min at 5s intervals), discard tokens and restart full OIDC auth
via oidc_client_start(), giving the best chance of recovery once the
server is healthy again.
library/oidc.c Outdated
OIDC_LOG(WARN, "OIDC token refresh failed (%d/%s), attempt %d (total: %d)",
http_resp->code, err, clt->refresh_failures, clt->total_refresh_failures);

// after sustained failure (5 min at 5s intervals), give up on refresh and restart full auth
Copy link
Member

@ekoby ekoby Feb 11, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks good.
Just a couple of things here:

  • at default expiration window -- it would give up a third (10 minutes) of token lifetime (even more if window is higher). This may lead to unwanted downstream effects -- like channels/connections may be dropped prematurely.
  • the better way would be to try to utilize the full refresh window with randomized exponential backoff to avoid stampeding controller after a network outage.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good point. I replaced the fixed 60-failure threshold with token-expiry-based escalation. It now retries with randomized exponential backoff (5s→10s→20s→40s→60s cap, jittered to [delay/2, delay]) for the entire refresh window until the token actually expires. Added token_expiry field to track the absolute expiry time. Full re-auth only triggers when uv_now() >= token_expiry.

library/oidc.c Outdated
return;
}

// http_resp->code >= 0 but not 200: server-side rejection (e.g. EOF, 401, etc.)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

EOF(UV_EOF) is negative, so it will be covered in the if(http_resp->code < 0) clause above

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ofc you are right, I removed it. I fixed the comment on the remaining branch to say "e.g. 401, 403" instead of "e.g. EOF, 401" since EOF won't reach this clause here?

Copy link
Member

@ekoby ekoby left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

looks good

just a couple of possible improvements -- let me know if you want to take a crack at them, if not I can put them on top of your changes

@ekoby
Copy link
Member

ekoby commented Feb 11, 2026

BTW, add yourself to CONTRIBUTORS list if you so wish

Replace fixed 5s retry interval and 60-failure threshold with
randomized exponential backoff (5s-60s) that utilizes the full
token refresh window. Escalate to full re-auth only when the
token actually expires, not after a fixed number of attempts.
@github-actions
Copy link

github-actions bot commented Feb 11, 2026

All contributors have signed the CLA ✍️ ✅
Posted by the CLA Assistant Lite bot.

@dmuensterer
Copy link
Author

I have read the CLA Document and I hereby sign the CLA

netfoundry-cla-bot bot added a commit to netfoundry/cla that referenced this pull request Feb 11, 2026
library/oidc.c Outdated
return;
}

if (clt->refresh_failures >= 3) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this should be removed now, right?

bool need_refresh;
struct auth_req *request;
tlsuv_http_req_t *refresh_req;
int refresh_failures;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think you only need one counter. I see they are set and incremented at the same time

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yup, fixed!

Merge total_refresh_failures into refresh_failures since both were
always incremented and reset together. Remove the periodic 3-failure
connection reset, which is no longer needed with exponential backoff.
Guard against missing or invalid Location headers, query strings,
and expected query parameters in code_cb and auth_cb. Also truncate
authRequestID at '&' to avoid capturing trailing query parameters.
}
p += strlen("authRequestID=");
req->id = strdup(p);
char *end = strchr(p, '&'); // stop at next query param to avoid capturing trailing params
Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If the URL looks e.g. like: /oidc/authorize?authRequestID=abc123&state=xyz

Then req->id becomes abc123&state=xyz instead of just abc123. That polluted ID then gets used in:

snprintf(path, sizeof(path), "/oidc/login/cert?id=%s", req->id);

@dovholuknf
Copy link
Member

recheck

@dovholuknf
Copy link
Member

recheck cla again please

@dovholuknf
Copy link
Member

recheck cla again...

@ekoby
Copy link
Member

ekoby commented Feb 24, 2026

@dmuensterer we'd like to merge this PR.
You need to sign your commits. see here https://docs.github.com/en/authentication/managing-commit-signature-verification/signing-commits
once you set up signing on your local repo you can rebase against latest and that will force the commits to be signed.

don't worry about fixing windows build. I can take care of it after

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants