NAS-140180 / 27.0.0-BETA.1 / Update NFS CI for resilience#18397
NAS-140180 / 27.0.0-BETA.1 / Update NFS CI for resilience#18397
Conversation
…ing the CI tests. A good part of this is handling the 'websocket' disconnect.
1 similar comment
|
I'm not sure we should be papering over websocket disconnections. If this is readily reproducible we should get to the bottom of it. |
| for attempt in range(3): | ||
| # Allow half a sec and retries | ||
| sleep(0.5) | ||
| try: | ||
| result = int(ssh("cat /proc/fs/nfsd/threads")) | ||
| except ClientException as e: | ||
| # websocket loss resilience | ||
| if not _is_ws_disconnect(e) or attempt >= 2: | ||
| raise | ||
| if result == expected: | ||
| break | ||
|
|
||
| assert result == expected, result |
There was a problem hiding this comment.
🔴 Bug: In confirm_nfsd_processes, when ssh() raises a ClientException that is a WebSocket disconnect on the first attempt, result is never assigned (the assignment is inside the try block that raised), but execution falls through to if result == expected: on line 335, causing UnboundLocalError. Add continue after the exception is caught and not re-raised.
Extended reasoning...
What the bug is
The confirm_nfsd_processes function (lines 326–338) has a missing continue statement in its exception handler. When ssh("cat /proc/fs/nfsd/threads") raises a ClientException that is identified as a transient WebSocket disconnect (and attempt < 2), the except block catches the exception but does not re-raise it. However, without a continue, control flow falls through to if result == expected: on line 335.
Why it fails
The variable result is assigned inside the try block: result = int(ssh(...)). When the ssh() call raises, this assignment never executes. On the first iteration (attempt=0), result has never been set in any prior iteration either, so it is completely unbound. Accessing it on line 335 raises UnboundLocalError: local variable result referenced before assignment.
Step-by-step proof
confirm_nfsd_processes(expected=8)is called.- Loop begins with
attempt=0. sleep(0.5)executes.- Inside the
tryblock,ssh("cat /proc/fs/nfsd/threads")raises aClientExceptionwith errno 103 (WebSocket closed). - The
except ClientExceptionhandler fires._is_ws_disconnect(e)returnsTrueandattemptis0(which is< 2), so theraiseis skipped. - With no
continueorreturn, execution proceeds past theexceptblock toif result == expected:on line 335. resultwas never assigned (the assignment on line 332 was inside thetrythat raised), so Python raisesUnboundLocalError.
Secondary issue on subsequent iterations
Even if a previous iteration successfully assigned result, the missing continue means the code checks a stale result from the prior iteration rather than retrying the SSH call. This could cause the function to incorrectly break out of the retry loop with an outdated value.
Impact
This bug turns a transient, recoverable WebSocket disconnection into an unhandled UnboundLocalError, crashing the test. This directly undermines the stated goal of the PR (adding resilience to WebSocket instability). The other retry helpers in this PR (resilient_nfs_control, resilient_nfs_update) all correctly handle this pattern.
Fix
Add continue after the guarded raise in the except block:
except ClientException as e:
if not _is_ws_disconnect(e) or attempt >= 2:
raise
continue # retry the ssh call| current = call("nfs.config") | ||
| if all(current.get(k) == v for k, v in payload.items()): | ||
| # Config was applied but the restart may still be in progress. | ||
| # Poll until the service is RUNNING so that callers can safely | ||
| # query rpcbind or /etc/nfs.conf without racing. | ||
| for _ in range(30): | ||
| if get_nfs_service_state() == 'RUNNING': | ||
| break | ||
| sleep(1) | ||
| return current |
There was a problem hiding this comment.
🟡 resilient_nfs_update config-applied check at line 258 produces a false negative when payload={"servers": None}. The NFS API (nfs_extend()) transforms servers=None into a computed integer (e.g. 12), so nfs.config() returns {"servers": 12} and the comparison 12 == None is False, triggering an unnecessary retry and NFS restart. Consider filtering None values from the payload comparison or reading the pre-transform DB value.
Extended reasoning...
What the bug is
The resilient_nfs_update function (lines 249–268) retries nfs.update on transient WebSocket disconnections. After a failed call, it reads back the current config via call("nfs.config") and checks whether the update was already applied:
current = call("nfs.config")
if all(current.get(k) == v for k, v in payload.items()):This comparison assumes that the values returned by nfs.config() are identical to the values passed in payload. However, the NFS service extend logic (nfs_extend() in nfs.py:170–176) transforms certain input values before storing/returning them. Specifically, servers=None is replaced with min(max(core_count, 1), 32) — a computed integer like 12.
How it manifests
When resilient_nfs_update({"servers": None}) is called and a WebSocket disconnect occurs:
- The first
nfs.update({"servers": None})call succeeds on the server side (servers is set to auto-managed in the DB), but the client gets aClientException. - The retry logic calls
nfs.config(), which returns{"servers": 12, ...}(the computed value). - The check evaluates
12 == None→False. - The function concludes the update was NOT applied and retries, calling
nfs.update({"servers": None})again — causing an unnecessary NFS restart. - If all 3 attempts hit WebSocket disconnects, the function raises despite the update having succeeded on the very first call.
Step-by-step proof with test_service_update
test_service_update (line 883) parametrizes with nfsd=None and calls resilient_nfs_update({"servers": None}) at line 895. With cores=4 mocked:
- Call 1:
nfs.update({"servers": None})→ server setsservers=4internally → WS disconnect →ClientExceptionraised. - Config check:
call("nfs.config")returns{"servers": 4, ...}. - Comparison:
4 == None→False→ retry triggered. - Call 2:
nfs.update({"servers": None})again → unnecessary NFS restart → if WS disconnect again, one more retry. - Call 3: if WS disconnect again →
raise→ test fails despite the update having succeeded on attempt 1.
Impact and severity
The practical impact is limited: (1) The retry is idempotent — it just causes an unnecessary NFS service restart. (2) Test failure only occurs if all 3 attempts hit WS disconnects, which is unlikely. (3) This is CI test infrastructure code, not production code. This makes it a nit-level issue.
How to fix
Filter out None values from the payload before comparison, since None means "let the server compute the value":
filtered = {k: v for k, v in payload.items() if v is not None}
if all(current.get(k) == v for k, v in filtered.items()):Alternatively, for None values, just check that the key exists in the config (any non-None value means the server accepted and transformed it).
The NFS CI test module needs updating, cleanup and refactoring. However, there are CI test failures that should be addressed first.
The NFS CI tests have been failing often due to timing issues and, more recently, websocket instability.
This PR adds resilience to timing and websocket issues for the NFS CI test module.
With these changes, the NFS CI test module,
test_300_nfs.pywas run 10 times in a row without errors. This was done against a jenkins VM and a local VM.This is safe for 26 beta 2.
Adding WIP label to investigate websocket instability.