NAS-140180 / 27.0.0-BETA.1 / Update NFS CI for resilience by mgrimesix · Pull Request #18397 · truenas/middleware

mgrimesix · 2026-03-06T22:15:15Z

The NFS CI test module needs updating, cleanup and refactoring. However, there are CI test failures that should be addressed first.

The NFS CI tests have been failing often due to timing issues and, more recently, websocket instability.

This PR adds resilience to timing and websocket issues for the NFS CI test module.

With these changes, the NFS CI test module, test_300_nfs.py was run 10 times in a row without errors. This was done against a jenkins VM and a local VM.

This is safe for 26 beta 2.

Adding WIP label to investigate websocket instability.

…ing the CI tests. A good part of this is handling the 'websocket' disconnect.

bugclerk · 2026-03-06T22:15:33Z

Jira URL: https://ixsystems.atlassian.net/browse/NAS-140180

bugclerk · 2026-03-06T22:16:00Z

Jira URL: https://ixsystems.atlassian.net/browse/NAS-140180

anodos325 · 2026-03-06T22:22:59Z

I'm not sure we should be papering over websocket disconnections. If this is readily reproducible we should get to the bottom of it.

claude · 2026-03-06T22:56:26Z

tests/api2/test_300_nfs.py

+    for attempt in range(3):
+        # Allow half a sec and retries
+        sleep(0.5)
+        try:
+            result = int(ssh("cat /proc/fs/nfsd/threads"))
+        except ClientException as e:
+            # websocket loss resilience
+            if not _is_ws_disconnect(e) or attempt >= 2:
+                raise
+        if result == expected:
+            break
+
+    assert result == expected, result


🔴 Bug: In confirm_nfsd_processes, when ssh() raises a ClientException that is a WebSocket disconnect on the first attempt, result is never assigned (the assignment is inside the try block that raised), but execution falls through to if result == expected: on line 335, causing UnboundLocalError. Add continue after the exception is caught and not re-raised.

Extended reasoning...

What the bug is

The confirm_nfsd_processes function (lines 326–338) has a missing continue statement in its exception handler. When ssh("cat /proc/fs/nfsd/threads") raises a ClientException that is identified as a transient WebSocket disconnect (and attempt < 2), the except block catches the exception but does not re-raise it. However, without a continue, control flow falls through to if result == expected: on line 335.

Why it fails

The variable result is assigned inside the try block: result = int(ssh(...)). When the ssh() call raises, this assignment never executes. On the first iteration (attempt=0), result has never been set in any prior iteration either, so it is completely unbound. Accessing it on line 335 raises UnboundLocalError: local variable result referenced before assignment.

Step-by-step proof

confirm_nfsd_processes(expected=8) is called.

Loop begins with attempt=0.

sleep(0.5) executes.

Inside the try block, ssh("cat /proc/fs/nfsd/threads") raises a ClientException with errno 103 (WebSocket closed).

The except ClientException handler fires. _is_ws_disconnect(e) returns True and attempt is 0 (which is < 2), so the raise is skipped.

With no continue or return, execution proceeds past the except block to if result == expected: on line 335.

result was never assigned (the assignment on line 332 was inside the try that raised), so Python raises UnboundLocalError.

Secondary issue on subsequent iterations

Even if a previous iteration successfully assigned result, the missing continue means the code checks a stale result from the prior iteration rather than retrying the SSH call. This could cause the function to incorrectly break out of the retry loop with an outdated value.

Impact

This bug turns a transient, recoverable WebSocket disconnection into an unhandled UnboundLocalError, crashing the test. This directly undermines the stated goal of the PR (adding resilience to WebSocket instability). The other retry helpers in this PR (resilient_nfs_control, resilient_nfs_update) all correctly handle this pattern.

Fix

Add continue after the guarded raise in the except block:

except ClientException as e: if not _is_ws_disconnect(e) or attempt >= 2: raise continue # retry the ssh call

claude · 2026-03-06T22:56:26Z

tests/api2/test_300_nfs.py

+            current = call("nfs.config")
+            if all(current.get(k) == v for k, v in payload.items()):
+                # Config was applied but the restart may still be in progress.
+                # Poll until the service is RUNNING so that callers can safely
+                # query rpcbind or /etc/nfs.conf without racing.
+                for _ in range(30):
+                    if get_nfs_service_state() == 'RUNNING':
+                        break
+                    sleep(1)
+                return current


🟡 resilient_nfs_update config-applied check at line 258 produces a false negative when payload={"servers": None}. The NFS API (nfs_extend()) transforms servers=None into a computed integer (e.g. 12), so nfs.config() returns {"servers": 12} and the comparison 12 == None is False, triggering an unnecessary retry and NFS restart. Consider filtering None values from the payload comparison or reading the pre-transform DB value.

Extended reasoning...

What the bug is

The resilient_nfs_update function (lines 249–268) retries nfs.update on transient WebSocket disconnections. After a failed call, it reads back the current config via call("nfs.config") and checks whether the update was already applied:

current = call("nfs.config") if all(current.get(k) == v for k, v in payload.items()):

This comparison assumes that the values returned by nfs.config() are identical to the values passed in payload. However, the NFS service extend logic (nfs_extend() in nfs.py:170–176) transforms certain input values before storing/returning them. Specifically, servers=None is replaced with min(max(core_count, 1), 32) — a computed integer like 12.

How it manifests

When resilient_nfs_update({"servers": None}) is called and a WebSocket disconnect occurs:

The first nfs.update({"servers": None}) call succeeds on the server side (servers is set to auto-managed in the DB), but the client gets a ClientException.

The retry logic calls nfs.config(), which returns {"servers": 12, ...} (the computed value).

The check evaluates 12 == None → False.

The function concludes the update was NOT applied and retries, calling nfs.update({"servers": None}) again — causing an unnecessary NFS restart.

If all 3 attempts hit WebSocket disconnects, the function raises despite the update having succeeded on the very first call.

Step-by-step proof with test_service_update

test_service_update (line 883) parametrizes with nfsd=None and calls resilient_nfs_update({"servers": None}) at line 895. With cores=4 mocked:

Call 1: nfs.update({"servers": None}) → server sets servers=4 internally → WS disconnect → ClientException raised.

Config check: call("nfs.config") returns {"servers": 4, ...}.

Comparison: 4 == None → False → retry triggered.

Call 2: nfs.update({"servers": None}) again → unnecessary NFS restart → if WS disconnect again, one more retry.

Call 3: if WS disconnect again → raise → test fails despite the update having succeeded on attempt 1.

Impact and severity

The practical impact is limited: (1) The retry is idempotent — it just causes an unnecessary NFS service restart. (2) Test failure only occurs if all 3 attempts hit WS disconnects, which is unlikely. (3) This is CI test infrastructure code, not production code. This makes it a nit-level issue.

How to fix

Filter out None values from the payload before comparison, since None means "let the server compute the value":

filtered = {k: v for k, v in payload.items() if v is not None} if all(current.get(k) == v for k, v in filtered.items()):

Alternatively, for None values, just check that the key exists in the config (any non-None value means the server accepted and transformed it).

Add many small-ish tweaks to add resilience to idiosyncrasies in runn…

9c501d3

…ing the CI tests. A good part of this is handling the 'websocket' disconnect.

mgrimesix requested a review from a team March 6, 2026 22:15

mgrimesix added jira-low backport-26.0.0-BETA.2 labels Mar 6, 2026

bugclerk changed the title ~~Update NFS CI for resilience~~ NAS-140180 / 27.0.0-BETA.1 / Update NFS CI for resilience Mar 6, 2026

mgrimesix added the WIP label Mar 6, 2026

claude bot reviewed Mar 6, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

NAS-140180 / 27.0.0-BETA.1 / Update NFS CI for resilience#18397

NAS-140180 / 27.0.0-BETA.1 / Update NFS CI for resilience#18397
mgrimesix wants to merge 1 commit intomasterfrom
update-nfs-ci

mgrimesix commented Mar 6, 2026 •

edited

Loading

Uh oh!

bugclerk commented Mar 6, 2026

Uh oh!

bugclerk commented Mar 6, 2026

Uh oh!

anodos325 commented Mar 6, 2026

Uh oh!

claude bot Mar 6, 2026

Uh oh!

claude bot Mar 6, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

mgrimesix commented Mar 6, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Adding WIP label to investigate websocket instability.

Uh oh!

bugclerk commented Mar 6, 2026

Uh oh!

bugclerk commented Mar 6, 2026

Uh oh!

anodos325 commented Mar 6, 2026

Uh oh!

claude bot Mar 6, 2026

Choose a reason for hiding this comment

What the bug is

Why it fails

Step-by-step proof

Secondary issue on subsequent iterations

Impact

Fix

Uh oh!

claude bot Mar 6, 2026

Choose a reason for hiding this comment

What the bug is

How it manifests

Step-by-step proof with test_service_update

Impact and severity

How to fix

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

mgrimesix commented Mar 6, 2026 •

edited

Loading