fix: OIDC token refresh recovery after network disruption by dmuensterer · Pull Request #991 · openziti/ziti-sdk-c

dmuensterer · 2026-02-11T07:45:14Z

A brief network disruption permanently breaks the OIDC tlsuv_http_t handle. Subsequent refresh attempts retry on the same broken handle and fail indefinitely. With only a 30-second refresh margin, retries are exhausted before the token expires, leading to UNAUTHORIZED errors that tear down all channels and require a full restart.

Three changes:

Reset broken HTTP handle after consecutive failures Add a refresh_failures counter. After 3 consecutive connection failures (code < 0), call tlsuv_http_cancel_all() to close the transport and reset the handle to Disconnected state, allowing tlsuv's auto-reconnect to establish a fresh connection. Also fixes a bug where the if-chain in refresh_cb used non-exclusive conditions, causing UV_EOF to trigger both "restart auth" and "5s retry" simultaneously.
Add 15s connect timeout to OIDC client The OIDC HTTP client was missing tlsuv_http_connect_timeout(), inheriting the kernel default (~130s). A single timeout could consume the entire refresh window. Now matches the controller client's 15s timeout.
Use half-lifetime refresh margin instead of fixed 30s Change token refresh scheduling from (expires_in - 30) to (expires_in / 2). For a typical 1800s token, this gives 900s of retry time (~120 attempts) instead of 30s (~6 attempts).

A brief network disruption permanently breaks the OIDC tlsuv_http_t handle. Subsequent refresh attempts retry on the same broken handle and fail indefinitely. With only a 30-second refresh margin, retries are exhausted before the token expires, leading to UNAUTHORIZED errors that tear down all channels and require a full restart. Three changes: 1. Reset broken HTTP handle after consecutive failures Add a refresh_failures counter. After 3 consecutive connection failures (code < 0), call tlsuv_http_cancel_all() to close the transport and reset the handle to Disconnected state, allowing tlsuv's auto-reconnect to establish a fresh connection. Also fixes a bug where the if-chain in refresh_cb used non-exclusive conditions, causing UV_EOF to trigger both "restart auth" and "5s retry" simultaneously. 2. Add 15s connect timeout to OIDC client The OIDC HTTP client was missing tlsuv_http_connect_timeout(), inheriting the kernel default (~130s). A single timeout could consume the entire refresh window. Now matches the controller client's 15s timeout. 3. Use half-lifetime refresh margin instead of fixed 30s Change token refresh scheduling from (expires_in - 30) to (expires_in / 2). For a typical 300s token, this gives 150s of retry time (~30 attempts) instead of 30s (~6 attempts).

The refresh_cb retry loop resets refresh_failures to 0 after every 3-failure handle reset cycle, so it never escapes the retry loop even when the server is persistently unreachable. Add total_refresh_failures counter that accumulates across reset cycles. After 60 total failures (~5 min at 5s intervals), discard tokens and restart full OIDC auth via oidc_client_start(), giving the best chance of recovery once the server is healthy again.

ekoby · 2026-02-11T15:45:30Z

library/oidc.c

+        OIDC_LOG(WARN, "OIDC token refresh failed (%d/%s), attempt %d (total: %d)",
+                 http_resp->code, err, clt->refresh_failures, clt->total_refresh_failures);
+
+        // after sustained failure (5 min at 5s intervals), give up on refresh and restart full auth


This looks good.
Just a couple of things here:

at default expiration window -- it would give up a third (10 minutes) of token lifetime (even more if window is higher). This may lead to unwanted downstream effects -- like channels/connections may be dropped prematurely.

the better way would be to try to utilize the full refresh window with randomized exponential backoff to avoid stampeding controller after a network outage.

Good point. I replaced the fixed 60-failure threshold with token-expiry-based escalation. It now retries with randomized exponential backoff (5s→10s→20s→40s→60s cap, jittered to [delay/2, delay]) for the entire refresh window until the token actually expires. Added token_expiry field to track the absolute expiry time. Full re-auth only triggers when uv_now() >= token_expiry.

ekoby · 2026-02-11T15:50:12Z

library/oidc.c

        return;
    }
+
+    // http_resp->code >= 0 but not 200: server-side rejection (e.g. EOF, 401, etc.)


EOF(UV_EOF) is negative, so it will be covered in the if(http_resp->code < 0) clause above

Ofc you are right, I removed it. I fixed the comment on the remaining branch to say "e.g. 401, 403" instead of "e.g. EOF, 401" since EOF won't reach this clause here?

ekoby

looks good

just a couple of possible improvements -- let me know if you want to take a crack at them, if not I can put them on top of your changes

ekoby · 2026-02-11T17:00:33Z

BTW, add yourself to CONTRIBUTORS list if you so wish

Replace fixed 5s retry interval and 60-failure threshold with randomized exponential backoff (5s-60s) that utilizes the full token refresh window. Escalate to full re-auth only when the token actually expires, not after a fixed number of attempts.

github-actions · 2026-02-11T18:44:41Z

All contributors have signed the CLA ✍️ ✅
_{Posted by the CLA Assistant Lite bot.}

dmuensterer · 2026-02-11T18:54:55Z

I have read the CLA Document and I hereby sign the CLA

ekoby · 2026-02-11T20:26:01Z

library/oidc.c

+            return;
+        }
+
+        if (clt->refresh_failures >= 3) {


this should be removed now, right?

library/oidc.c

ekoby · 2026-02-11T20:33:52Z

inc_internal/oidc.h

    bool need_refresh;
    struct auth_req *request;
    tlsuv_http_req_t *refresh_req;
+    int refresh_failures;


I think you only need one counter. I see they are set and incremented at the same time

Yup, fixed!

Merge total_refresh_failures into refresh_failures since both were always incremented and reset together. Remove the periodic 3-failure connection reset, which is no longer needed with exponential backoff.

Guard against missing or invalid Location headers, query strings, and expected query parameters in code_cb and auth_cb. Also truncate authRequestID at '&' to avoid capturing trailing query parameters.

dmuensterer · 2026-02-12T08:53:14Z

library/oidc.c

+        }
        p += strlen("authRequestID=");
-        req->id = strdup(p);
+        char *end = strchr(p, '&'); // stop at next query param to avoid capturing trailing params


If the URL looks e.g. like: /oidc/authorize?authRequestID=abc123&state=xyz

Then req->id becomes abc123&state=xyz instead of just abc123. That polluted ID then gets used in:

snprintf(path, sizeof(path), "/oidc/login/cert?id=%s", req->id);

dovholuknf · 2026-02-13T17:00:53Z

recheck

dovholuknf · 2026-02-13T17:14:28Z

recheck cla again please

dovholuknf · 2026-02-13T17:20:34Z

recheck cla again...

ekoby · 2026-02-24T13:15:44Z

@dmuensterer we'd like to merge this PR.
You need to sign your commits. see here https://docs.github.com/en/authentication/managing-commit-signature-verification/signing-commits
once you set up signing on your local repo you can rebase against latest and that will force the commits to be signed.

don't worry about fixing windows build. I can take care of it after

dmuensterer requested a review from a team as a code owner February 11, 2026 07:45

dmuensterer mentioned this pull request Feb 11, 2026

OIDC token refresh never recovers from network disruption, permanent UNAUTHORIZED state requires restart #990

Open

ekoby reviewed Feb 11, 2026

View reviewed changes

netfoundry-cla-bot bot added a commit to netfoundry/cla that referenced this pull request Feb 11, 2026

@dmuensterer has signed the CLA in openziti/ziti-sdk-c#991

27df2b2

ekoby reviewed Feb 11, 2026

View reviewed changes

library/oidc.c Show resolved Hide resolved

ekoby reviewed Feb 11, 2026

View reviewed changes

dmuensterer added 2 commits February 12, 2026 10:18

Consolidate refresh failure counters and remove 3-failure reset

53c9e00

Merge total_refresh_failures into refresh_failures since both were always incremented and reset together. Remove the periodic 3-failure connection reset, which is no longer needed with exponential backoff.

Add NULL checks for redirect parsing in OIDC auth flow

b0f3edb

Guard against missing or invalid Location headers, query strings, and expected query parameters in code_cb and auth_cb. Also truncate authRequestID at '&' to avoid capturing trailing query parameters.

dmuensterer commented Feb 12, 2026

View reviewed changes

Merge branch 'main' into fix/oidc-reconnect-refresh

8060597

Merge branch 'main' into fix/oidc-reconnect-refresh

fbee24d

Conversation

dmuensterer commented Feb 11, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

ekoby Feb 11, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

dmuensterer Feb 11, 2026

Choose a reason for hiding this comment

Uh oh!

ekoby Feb 11, 2026

Choose a reason for hiding this comment

Uh oh!

dmuensterer Feb 11, 2026

Choose a reason for hiding this comment

Uh oh!

ekoby left a comment

Choose a reason for hiding this comment

Uh oh!

ekoby commented Feb 11, 2026

Uh oh!

github-actions bot commented Feb 11, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

dmuensterer commented Feb 11, 2026

Uh oh!

ekoby Feb 11, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!

ekoby Feb 11, 2026

Choose a reason for hiding this comment

Uh oh!

dmuensterer Feb 12, 2026

Choose a reason for hiding this comment

Uh oh!

dmuensterer Feb 12, 2026

Choose a reason for hiding this comment

Uh oh!

dovholuknf commented Feb 13, 2026

Uh oh!

dovholuknf commented Feb 13, 2026

Uh oh!

dovholuknf commented Feb 13, 2026

Uh oh!

ekoby commented Feb 24, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

dmuensterer commented Feb 11, 2026 •

edited

Loading

ekoby Feb 11, 2026 •

edited

Loading

github-actions bot commented Feb 11, 2026 •

edited

Loading