fix: File-Metrics-Collector race condition by Priyanshu-u07 · Pull Request #2615 · kubeflow/katib

Priyanshu-u07 · 2026-02-03T10:44:59Z

Description

This PR fixes a race condition in the file-metrics-collector where the sidecar may fail to exit if the main training container finishes before the collector can detect its PID. This commonly occurs with fast-running training jobs or when sidecars present, leaving the Trial Pod stuck in a Running state and blocking experiment completion.

To address this, a hybrid fallback mechanism is introduced. The collector first attempts the standard PID-based detection via /proc. If the main process is not found, it falls back to scanning the /katib directory for existing .pid marker files (for example, 45.pid containing completed). When a valid completion marker is detected, the collector safely proceeds to collect metrics and exits cleanly instead of waiting indefinitely.

Changes Included

pkg/metricscollector/v1beta1/common/pns.go

Added isAlreadyCompleted helper to scan the marker directory for .pid files containing a completed status.
Updated WaitMainProcesses to introduce fallback logic when GetMainProcesses fails or no mainPid is detected.
Improved error handling so PID detection errors are returned only when no valid completion marker is found.

pkg/metricscollector/v1beta1/common/pns_test.go (new)

Added comprehensive unit tests covering:
- Absence of marker files
- Detection of a valid completed marker
- Ignoring early-stopped markers
- Multi-process environments (e.g. Istio sidecars with multiple .pid files)
- Marker files with whitespace and newline variations

Verification

Added unit tests covering fast-exit and multiple PID marker scenarios.
Reproduced the race condition locally using Docker and confirmed clean exit when a completion marker exists.

Fixes #2614

Checklist:

Docs included if any changes are user facing

Signed-off-by: Priyanshu-u07 <connect.priyanshu8271@gmail.com>

github-actions · 2026-02-03T10:45:11Z

🎉 Welcome to the Kubeflow Katib repo! 🎉

Thanks for opening your first PR! We're excited to have you onboard 🚀

Next steps:

Our team will review your PR soon! cc @kubeflow/wg-automl-leads
Check out the Contributing Guide and the Kubeflow Contributor Guide
Join the Kubeflow Slack channels: https://www.kubeflow.org/docs/about/community/#kubeflow-slack-channels
Join the AutoML & Training WG meetings: https://bit.ly/2PWVCkV

Feel free to ask questions in the comments. Thanks again for contributing! 🙏

google-oss-prow · 2026-02-03T10:45:19Z

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by:
Once this PR has been reviewed and has the lgtm label, please assign andreyvelich for approval. For more information see the Kubernetes Code Review Process.

The full list of commands accepted by this bot can be found here.

Details

Needs approval from an approver in each of these files:

OWNERS

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

Priyanshu-u07 · 2026-02-16T07:14:45Z

@anencore94 @Electronic-Waste @johnugeorge
Please check once.

anencore94

Thanks for the clear bug report linkage and for adding tests with realistic marker-file
scenarios.
I really like the direction of adding a marker-based fallback for fast-exit training
jobs.

I have one concern to consider before merge: we may still miss a race case where
GetMainProcesses() returns a non-zero but incorrect mainPid (e.g., a remaining top-
level sidecar process), so the new fallback branch is skipped and WaitPIDs() can still
wait indefinitely.
Would you consider making completion-marker detection independent of only err/ mainPid==0, or retrying main-process detection + marker check together until timeout?

anencore94 · 2026-02-18T14:10:51Z

pkg/metricscollector/v1beta1/common/pns.go

+	// Fallback: If main process not found (race condition where training
+	// exits before we can detect it), check for existing completion marker
+	if err != nil || mainPid == 0 {


I think this condition is still too narrow.
A race can also return a non-zero wrong mainPid (e.g., sidecar process after the training process already exited) In that case we skip this fallback logic and may block in WaitPIDs() forever.
Could we make completion-marker check independent from this condition ? (or retry main process detection with marker check until timeout?)

Fixed. Please check and resolve

anencore94 · 2026-02-18T14:12:31Z

pkg/metricscollector/v1beta1/common/pns.go

+		if err != nil {
+			klog.Warningf("Failed to read marker file %s: %v", f, err)


Can we return (bool, error) from isAlreadyCompleted instead of warning error and skip?
I think this could make this race/debug path harder

Fixed. Please check and resolve

anencore94 · 2026-02-18T14:15:39Z

pkg/metricscollector/v1beta1/common/pns_test.go

+	"testing"
+)
+
+func TestIsAlreadyCompleted(t *testing.T) {


Along with theses tests to validate isAlreadyCompleted, I think we need to test the main regression case at WaitMainProcesses level. (pid detection + fallback decision)
Could we add a test that covers the actual race path ? (e.g., main process already exited, but marker exists case)

For example:

GetMainProcesses failed, but marker has completed -> return success

GetMainProcesses return wrong pid, but marker has completed -> return success

Fixed. Please check and resolve

Signed-off-by: Priyanshu-u07 <connect.priyanshu8271@gmail.com>

Priyanshu-u07 · 2026-02-19T11:56:51Z

@anencore94 I have addressed the suggested changes.
Could you please take another look when you have time?
Thanks!

Priyanshu-u07 · 2026-02-19T20:49:33Z

The 6 failing E2E checks are unrelated to this PR. They use the StdOut collector, while this PR only modifies the file-metrics-collector. All unit tests pass.

ruskaruma · 2026-02-23T20:12:56Z

I apologise if this is redundant with the ongoing review. I happened to have stumbled across this PR and just wanted to flag a remaining edge case, which is if training is still running when isAlreadyCompleted returns false, but exits between this check and GetMainProcesses scanning /proc, the sidecar PID can get picked up as mainPid. WaitPIDs then blocks on the sidecar forever as it only checks marker files for tracked PIDs after they disappear from /proc (line ~135), and the sidecar never disappears.

A periodic isAlreadyCompleted check inside the WaitPIDs polling loop (around line 128) would close this window:

// Inside the WaitPIDs for loop, each iteration:
if opts.CompletedMarkedDirPath != "" {
    if completed, _ := isAlreadyCompleted(opts.CompletedMarkedDirPath); completed {
        klog.Info("Training completed detected via marker during wait loop")
        return nil
    }
}

This aligns with @anencore94's earlier suggestion about retrying marker checks until timeout. That is all.

Signed-off-by: Priyanshu-u07 <connect.priyanshu8271@gmail.com>

Priyanshu-u07 · 2026-02-27T06:12:46Z

@ruskaruma Thanks for pointing this out.
I have addressed the changes suggested by you.

ruskaruma · 2026-02-27T06:55:44Z

@anencore94, this should close this. please review.

fix: File-Metrics-Collector race condition

f3f0f3f

Signed-off-by: Priyanshu-u07 <connect.priyanshu8271@gmail.com>

google-oss-prow bot added the size/L label Feb 3, 2026

google-oss-prow bot requested review from Electronic-Waste, anencore94 and johnugeorge February 3, 2026 10:45

anencore94 reviewed Feb 18, 2026

View reviewed changes

check completion marker before PID detection

94c5576

Signed-off-by: Priyanshu-u07 <connect.priyanshu8271@gmail.com>

add check in WaitPIDs loop and tests

1c102af

Signed-off-by: Priyanshu-u07 <connect.priyanshu8271@gmail.com>

		if err != nil {
		klog.Warningf("Failed to read marker file %s: %v", f, err)

Conversation

Priyanshu-u07 commented Feb 3, 2026

Uh oh!

github-actions bot commented Feb 3, 2026

Uh oh!

google-oss-prow bot commented Feb 3, 2026

Uh oh!

Priyanshu-u07 commented Feb 16, 2026

Uh oh!

anencore94 left a comment

Choose a reason for hiding this comment

Uh oh!

anencore94 Feb 18, 2026

Choose a reason for hiding this comment

Uh oh!

Priyanshu-u07 Feb 21, 2026

Choose a reason for hiding this comment

Uh oh!

anencore94 Feb 18, 2026

Choose a reason for hiding this comment

Uh oh!

Priyanshu-u07 Feb 21, 2026

Choose a reason for hiding this comment

Uh oh!

anencore94 Feb 18, 2026

Choose a reason for hiding this comment

Uh oh!

Priyanshu-u07 Feb 21, 2026

Choose a reason for hiding this comment

Uh oh!

Priyanshu-u07 commented Feb 19, 2026

Uh oh!

Priyanshu-u07 commented Feb 19, 2026

Uh oh!

ruskaruma commented Feb 23, 2026

Uh oh!

Priyanshu-u07 commented Feb 27, 2026

Uh oh!

ruskaruma commented Feb 27, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants