Skip to content

NAS-140180 / 27.0.0-BETA.1 / Update NFS CI for resilience#18397

Open
mgrimesix wants to merge 1 commit intomasterfrom
update-nfs-ci
Open

NAS-140180 / 27.0.0-BETA.1 / Update NFS CI for resilience#18397
mgrimesix wants to merge 1 commit intomasterfrom
update-nfs-ci

Conversation

@mgrimesix
Copy link
Contributor

@mgrimesix mgrimesix commented Mar 6, 2026

The NFS CI test module needs updating, cleanup and refactoring. However, there are CI test failures that should be addressed first.

The NFS CI tests have been failing often due to timing issues and, more recently, websocket instability.

This PR adds resilience to timing and websocket issues for the NFS CI test module.

With these changes, the NFS CI test module, test_300_nfs.py was run 10 times in a row without errors. This was done against a jenkins VM and a local VM.

This is safe for 26 beta 2.


Adding WIP label to investigate websocket instability.

…ing the CI tests.

A good part of this is handling the 'websocket' disconnect.
@mgrimesix mgrimesix requested a review from a team March 6, 2026 22:15
@bugclerk bugclerk changed the title Update NFS CI for resilience NAS-140180 / 27.0.0-BETA.1 / Update NFS CI for resilience Mar 6, 2026
@bugclerk
Copy link
Contributor

bugclerk commented Mar 6, 2026

1 similar comment
@bugclerk
Copy link
Contributor

bugclerk commented Mar 6, 2026

@anodos325
Copy link
Contributor

I'm not sure we should be papering over websocket disconnections. If this is readily reproducible we should get to the bottom of it.

@mgrimesix mgrimesix added the WIP label Mar 6, 2026
Comment on lines +326 to +338
for attempt in range(3):
# Allow half a sec and retries
sleep(0.5)
try:
result = int(ssh("cat /proc/fs/nfsd/threads"))
except ClientException as e:
# websocket loss resilience
if not _is_ws_disconnect(e) or attempt >= 2:
raise
if result == expected:
break

assert result == expected, result
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🔴 Bug: In confirm_nfsd_processes, when ssh() raises a ClientException that is a WebSocket disconnect on the first attempt, result is never assigned (the assignment is inside the try block that raised), but execution falls through to if result == expected: on line 335, causing UnboundLocalError. Add continue after the exception is caught and not re-raised.

Extended reasoning...

What the bug is

The confirm_nfsd_processes function (lines 326–338) has a missing continue statement in its exception handler. When ssh("cat /proc/fs/nfsd/threads") raises a ClientException that is identified as a transient WebSocket disconnect (and attempt < 2), the except block catches the exception but does not re-raise it. However, without a continue, control flow falls through to if result == expected: on line 335.

Why it fails

The variable result is assigned inside the try block: result = int(ssh(...)). When the ssh() call raises, this assignment never executes. On the first iteration (attempt=0), result has never been set in any prior iteration either, so it is completely unbound. Accessing it on line 335 raises UnboundLocalError: local variable result referenced before assignment.

Step-by-step proof

  1. confirm_nfsd_processes(expected=8) is called.
  2. Loop begins with attempt=0.
  3. sleep(0.5) executes.
  4. Inside the try block, ssh("cat /proc/fs/nfsd/threads") raises a ClientException with errno 103 (WebSocket closed).
  5. The except ClientException handler fires. _is_ws_disconnect(e) returns True and attempt is 0 (which is < 2), so the raise is skipped.
  6. With no continue or return, execution proceeds past the except block to if result == expected: on line 335.
  7. result was never assigned (the assignment on line 332 was inside the try that raised), so Python raises UnboundLocalError.

Secondary issue on subsequent iterations

Even if a previous iteration successfully assigned result, the missing continue means the code checks a stale result from the prior iteration rather than retrying the SSH call. This could cause the function to incorrectly break out of the retry loop with an outdated value.

Impact

This bug turns a transient, recoverable WebSocket disconnection into an unhandled UnboundLocalError, crashing the test. This directly undermines the stated goal of the PR (adding resilience to WebSocket instability). The other retry helpers in this PR (resilient_nfs_control, resilient_nfs_update) all correctly handle this pattern.

Fix

Add continue after the guarded raise in the except block:

except ClientException as e:
    if not _is_ws_disconnect(e) or attempt >= 2:
        raise
    continue  # retry the ssh call

Comment on lines +255 to +264
current = call("nfs.config")
if all(current.get(k) == v for k, v in payload.items()):
# Config was applied but the restart may still be in progress.
# Poll until the service is RUNNING so that callers can safely
# query rpcbind or /etc/nfs.conf without racing.
for _ in range(30):
if get_nfs_service_state() == 'RUNNING':
break
sleep(1)
return current
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🟡 resilient_nfs_update config-applied check at line 258 produces a false negative when payload={"servers": None}. The NFS API (nfs_extend()) transforms servers=None into a computed integer (e.g. 12), so nfs.config() returns {"servers": 12} and the comparison 12 == None is False, triggering an unnecessary retry and NFS restart. Consider filtering None values from the payload comparison or reading the pre-transform DB value.

Extended reasoning...

What the bug is

The resilient_nfs_update function (lines 249–268) retries nfs.update on transient WebSocket disconnections. After a failed call, it reads back the current config via call("nfs.config") and checks whether the update was already applied:

current = call("nfs.config")
if all(current.get(k) == v for k, v in payload.items()):

This comparison assumes that the values returned by nfs.config() are identical to the values passed in payload. However, the NFS service extend logic (nfs_extend() in nfs.py:170–176) transforms certain input values before storing/returning them. Specifically, servers=None is replaced with min(max(core_count, 1), 32) — a computed integer like 12.

How it manifests

When resilient_nfs_update({"servers": None}) is called and a WebSocket disconnect occurs:

  1. The first nfs.update({"servers": None}) call succeeds on the server side (servers is set to auto-managed in the DB), but the client gets a ClientException.
  2. The retry logic calls nfs.config(), which returns {"servers": 12, ...} (the computed value).
  3. The check evaluates 12 == NoneFalse.
  4. The function concludes the update was NOT applied and retries, calling nfs.update({"servers": None}) again — causing an unnecessary NFS restart.
  5. If all 3 attempts hit WebSocket disconnects, the function raises despite the update having succeeded on the very first call.

Step-by-step proof with test_service_update

test_service_update (line 883) parametrizes with nfsd=None and calls resilient_nfs_update({"servers": None}) at line 895. With cores=4 mocked:

  • Call 1: nfs.update({"servers": None}) → server sets servers=4 internally → WS disconnect → ClientException raised.
  • Config check: call("nfs.config") returns {"servers": 4, ...}.
  • Comparison: 4 == NoneFalse → retry triggered.
  • Call 2: nfs.update({"servers": None}) again → unnecessary NFS restart → if WS disconnect again, one more retry.
  • Call 3: if WS disconnect again → raise → test fails despite the update having succeeded on attempt 1.

Impact and severity

The practical impact is limited: (1) The retry is idempotent — it just causes an unnecessary NFS service restart. (2) Test failure only occurs if all 3 attempts hit WS disconnects, which is unlikely. (3) This is CI test infrastructure code, not production code. This makes it a nit-level issue.

How to fix

Filter out None values from the payload before comparison, since None means "let the server compute the value":

filtered = {k: v for k, v in payload.items() if v is not None}
if all(current.get(k) == v for k, v in filtered.items()):

Alternatively, for None values, just check that the key exists in the config (any non-None value means the server accepted and transformed it).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants