Skip to content

fix: File-Metrics-Collector race condition#2615

Open
Priyanshu-u07 wants to merge 3 commits intokubeflow:masterfrom
Priyanshu-u07:metrics-collector
Open

fix: File-Metrics-Collector race condition#2615
Priyanshu-u07 wants to merge 3 commits intokubeflow:masterfrom
Priyanshu-u07:metrics-collector

Conversation

@Priyanshu-u07
Copy link

Description

This PR fixes a race condition in the file-metrics-collector where the sidecar may fail to exit if the main training container finishes before the collector can detect its PID. This commonly occurs with fast-running training jobs or when sidecars present, leaving the Trial Pod stuck in a Running state and blocking experiment completion.

To address this, a hybrid fallback mechanism is introduced. The collector first attempts the standard PID-based detection via /proc. If the main process is not found, it falls back to scanning the /katib directory for existing .pid marker files (for example, 45.pid containing completed). When a valid completion marker is detected, the collector safely proceeds to collect metrics and exits cleanly instead of waiting indefinitely.

Changes Included

pkg/metricscollector/v1beta1/common/pns.go

  • Added isAlreadyCompleted helper to scan the marker directory for .pid files containing a completed status.
  • Updated WaitMainProcesses to introduce fallback logic when GetMainProcesses fails or no mainPid is detected.
  • Improved error handling so PID detection errors are returned only when no valid completion marker is found.

pkg/metricscollector/v1beta1/common/pns_test.go (new)

  • Added comprehensive unit tests covering:
    - Absence of marker files
    - Detection of a valid completed marker
    - Ignoring early-stopped markers
    - Multi-process environments (e.g. Istio sidecars with multiple .pid files)
    - Marker files with whitespace and newline variations

Verification

  • Added unit tests covering fast-exit and multiple PID marker scenarios.
  • Reproduced the race condition locally using Docker and confirmed clean exit when a completion marker exists.

Fixes #2614

Checklist:

  • Docs included if any changes are user facing

Signed-off-by: Priyanshu-u07 <connect.priyanshu8271@gmail.com>
@github-actions
Copy link

github-actions bot commented Feb 3, 2026

🎉 Welcome to the Kubeflow Katib repo! 🎉

Thanks for opening your first PR! We're excited to have you onboard 🚀

Next steps:

Feel free to ask questions in the comments. Thanks again for contributing! 🙏

@google-oss-prow
Copy link

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by:
Once this PR has been reviewed and has the lgtm label, please assign andreyvelich for approval. For more information see the Kubernetes Code Review Process.

The full list of commands accepted by this bot can be found here.

Details Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@Priyanshu-u07
Copy link
Author

@anencore94 @Electronic-Waste @johnugeorge
Please check once.

Copy link
Member

@anencore94 anencore94 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the clear bug report linkage and for adding tests with realistic marker-file
scenarios.
I really like the direction of adding a marker-based fallback for fast-exit training
jobs.

I have one concern to consider before merge: we may still miss a race case where
GetMainProcesses() returns a non-zero but incorrect mainPid (e.g., a remaining top-
level sidecar process), so the new fallback branch is skipped and WaitPIDs() can still
wait indefinitely.
Would you consider making completion-marker detection independent of only err/ mainPid==0, or retrying main-process detection + marker check together until timeout?

Comment on lines +48 to +50
// Fallback: If main process not found (race condition where training
// exits before we can detect it), check for existing completion marker
if err != nil || mainPid == 0 {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this condition is still too narrow.
A race can also return a non-zero wrong mainPid (e.g., sidecar process after the training process already exited) In that case we skip this fallback logic and may block in WaitPIDs() forever.
Could we make completion-marker check independent from this condition ? (or retry main process detection with marker check until timeout?)

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed. Please check and resolve

Comment on lines +83 to +84
if err != nil {
klog.Warningf("Failed to read marker file %s: %v", f, err)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we return (bool, error) from isAlreadyCompleted instead of warning error and skip?
I think this could make this race/debug path harder

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed. Please check and resolve

"testing"
)

func TestIsAlreadyCompleted(t *testing.T) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Along with theses tests to validate isAlreadyCompleted, I think we need to test the main regression case at WaitMainProcesses level. (pid detection + fallback decision)
Could we add a test that covers the actual race path ? (e.g., main process already exited, but marker exists case)

For example:

  1. GetMainProcesses failed, but marker has completed -> return success
  2. GetMainProcesses return wrong pid, but marker has completed -> return success

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed. Please check and resolve

Signed-off-by: Priyanshu-u07 <connect.priyanshu8271@gmail.com>
@Priyanshu-u07
Copy link
Author

@anencore94 I have addressed the suggested changes.
Could you please take another look when you have time?
Thanks!

@Priyanshu-u07
Copy link
Author

The 6 failing E2E checks are unrelated to this PR. They use the StdOut collector, while this PR only modifies the file-metrics-collector. All unit tests pass.

@ruskaruma
Copy link

I apologise if this is redundant with the ongoing review. I happened to have stumbled across this PR and just wanted to flag a remaining edge case, which is if training is still running when isAlreadyCompleted returns false, but exits between this check and GetMainProcesses scanning /proc, the sidecar PID can get picked up as mainPid. WaitPIDs then blocks on the sidecar forever as it only checks marker files for tracked PIDs after they disappear from /proc (line ~135), and the sidecar never disappears.

A periodic isAlreadyCompleted check inside the WaitPIDs polling loop (around line 128) would close this window:

// Inside the WaitPIDs for loop, each iteration:
if opts.CompletedMarkedDirPath != "" {
    if completed, _ := isAlreadyCompleted(opts.CompletedMarkedDirPath); completed {
        klog.Info("Training completed detected via marker during wait loop")
        return nil
    }
}

This aligns with @anencore94's earlier suggestion about retrying marker checks until timeout. That is all.

Signed-off-by: Priyanshu-u07 <connect.priyanshu8271@gmail.com>
@Priyanshu-u07
Copy link
Author

@ruskaruma Thanks for pointing this out.
I have addressed the changes suggested by you.

@ruskaruma
Copy link

@anencore94, this should close this. please review.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

File-metricscollector does not exit after collecting metrics, trial still running

3 participants