fix: File-Metrics-Collector race condition#2615
fix: File-Metrics-Collector race condition#2615Priyanshu-u07 wants to merge 3 commits intokubeflow:masterfrom
Conversation
Signed-off-by: Priyanshu-u07 <connect.priyanshu8271@gmail.com>
|
🎉 Welcome to the Kubeflow Katib repo! 🎉 Thanks for opening your first PR! We're excited to have you onboard 🚀 Next steps:
Feel free to ask questions in the comments. Thanks again for contributing! 🙏 |
|
[APPROVALNOTIFIER] This PR is NOT APPROVED This pull-request has been approved by: The full list of commands accepted by this bot can be found here. DetailsNeeds approval from an approver in each of these files:Approvers can indicate their approval by writing |
|
@anencore94 @Electronic-Waste @johnugeorge |
anencore94
left a comment
There was a problem hiding this comment.
Thanks for the clear bug report linkage and for adding tests with realistic marker-file
scenarios.
I really like the direction of adding a marker-based fallback for fast-exit training
jobs.
I have one concern to consider before merge: we may still miss a race case where
GetMainProcesses() returns a non-zero but incorrect mainPid (e.g., a remaining top-
level sidecar process), so the new fallback branch is skipped and WaitPIDs() can still
wait indefinitely.
Would you consider making completion-marker detection independent of only err/ mainPid==0, or retrying main-process detection + marker check together until timeout?
| // Fallback: If main process not found (race condition where training | ||
| // exits before we can detect it), check for existing completion marker | ||
| if err != nil || mainPid == 0 { |
There was a problem hiding this comment.
I think this condition is still too narrow.
A race can also return a non-zero wrong mainPid (e.g., sidecar process after the training process already exited) In that case we skip this fallback logic and may block in WaitPIDs() forever.
Could we make completion-marker check independent from this condition ? (or retry main process detection with marker check until timeout?)
There was a problem hiding this comment.
Fixed. Please check and resolve
| if err != nil { | ||
| klog.Warningf("Failed to read marker file %s: %v", f, err) |
There was a problem hiding this comment.
Can we return (bool, error) from isAlreadyCompleted instead of warning error and skip?
I think this could make this race/debug path harder
There was a problem hiding this comment.
Fixed. Please check and resolve
| "testing" | ||
| ) | ||
|
|
||
| func TestIsAlreadyCompleted(t *testing.T) { |
There was a problem hiding this comment.
Along with theses tests to validate isAlreadyCompleted, I think we need to test the main regression case at WaitMainProcesses level. (pid detection + fallback decision)
Could we add a test that covers the actual race path ? (e.g., main process already exited, but marker exists case)
For example:
GetMainProcessesfailed, but marker hascompleted-> return successGetMainProcessesreturn wrong pid, but marker hascompleted-> return success
There was a problem hiding this comment.
Fixed. Please check and resolve
Signed-off-by: Priyanshu-u07 <connect.priyanshu8271@gmail.com>
|
@anencore94 I have addressed the suggested changes. |
|
The 6 failing E2E checks are unrelated to this PR. They use the StdOut collector, while this PR only modifies the file-metrics-collector. All unit tests pass. |
|
I apologise if this is redundant with the ongoing review. I happened to have stumbled across this PR and just wanted to flag a remaining edge case, which is if training is still running when A periodic // Inside the WaitPIDs for loop, each iteration:
if opts.CompletedMarkedDirPath != "" {
if completed, _ := isAlreadyCompleted(opts.CompletedMarkedDirPath); completed {
klog.Info("Training completed detected via marker during wait loop")
return nil
}
}This aligns with @anencore94's earlier suggestion about retrying marker checks until timeout. That is all. |
Signed-off-by: Priyanshu-u07 <connect.priyanshu8271@gmail.com>
|
@ruskaruma Thanks for pointing this out. |
|
@anencore94, this should close this. please review. |
Description
This PR fixes a race condition in the file-metrics-collector where the sidecar may fail to exit if the main training container finishes before the collector can detect its PID. This commonly occurs with fast-running training jobs or when sidecars present, leaving the Trial Pod stuck in a Running state and blocking experiment completion.
To address this, a hybrid fallback mechanism is introduced. The collector first attempts the standard PID-based detection via /proc. If the main process is not found, it falls back to scanning the /katib directory for existing .pid marker files (for example, 45.pid containing completed). When a valid completion marker is detected, the collector safely proceeds to collect metrics and exits cleanly instead of waiting indefinitely.
Changes Included
pkg/metricscollector/v1beta1/common/pns.go
pkg/metricscollector/v1beta1/common/pns_test.go (new)
- Absence of marker files
- Detection of a valid completed marker
- Ignoring early-stopped markers
- Multi-process environments (e.g. Istio sidecars with multiple .pid files)
- Marker files with whitespace and newline variations
Verification
Fixes #2614
Checklist: