feat: log debug pod information before deleting pods by agent #249

mb-ii · 2025-07-28T16:34:16Z

The ClearML Agent immediately deletes pods that have completed, either successfully or unsuccessfully. This is not always desired because it makes debugging hard in cases where pods failed for random reasons. For example, ML workloads often run out of memory, which results in OOM events that kill pods. Once a pod is killed, the agent deletes it, but without any message or reason visible in the ClearML UI. This PR adds debug logging inside agent process that can be used to understand the state of pods that were deleted.

jkhenning · 2025-07-28T20:43:49Z

Hi @mb-ii,

Thanks for this contribution!

Since this introduces a new behavior, I'd like to limit the effect on existing usages and suggest a few changes.
The previous implementation sent a single command the delete all completed/failed pods - moving this to a per-pod command (after interrogating the exit status) will make it less stable for large-scale environments.
Since this is a debug feature, and so accuracy in terms of "sampling first and deleting later" will not have a huge effect on its usability, I suggest:

Adding an environment variable to control this option (e.g. K8S_GLUE_GET_POD_CLEANUP_INFO), only querying for the pods status and reporting it if this feature if on
Keeping the old "delete all matching pods" implementation and simply adding the new "interrogation" before doing so (and not individually deleting pods)
Logging the status not only to the debug log but also to the task log (so it will be visible in the ClearML UI as well)
Logging the exit status of all containers in the pod (in case there is more than one)
Moving this new code to a helper function (in a new python module under clearml_agent/glue/utilities.py or similar) to try and limit the growth of the k8s.py file 🙂

mb-ii · 2025-07-29T20:09:33Z

@jkhenning Thanks for the feedback! My only concert with this approach is the accuracy. When a pod is running for hours or days, and then is just fails and disappears, its hard and costly to reproduce what happened. If we don't break it up into 2 steps (first list, then delete), rather we delete everything matching a selector, we're risking race conditions. This would not be a debugging feature for us, in a sense that we would turn it on sometimes, rather we'd like to have this information always. Do you see a different way to achieve this?

jkhenning · 2025-07-30T12:30:43Z

@mb-ii I have not issue with supporting both and choosing the behavior using an env var - so that users which have many pods will not experience any delay

mb-ii · 2025-08-02T11:27:32Z

@jkhenning I've implemented the change according to your suggestions. I've also fixed 3 other small issues I came across. I've tested the changes by building an image and using it in a staging environment.

mb-ii · 2025-09-04T09:40:53Z

Hey @jkhenning, were you able to take a look at this by any chance? We've built a container with these changes and been using it in production for the last few weeks without issues. It would be great if we could use an official build.

mb-ii · 2025-09-04T10:02:24Z

An example of how output looks like:

mb-ii · 2025-09-16T07:46:22Z

Hey @jkhenning, any chance we can move forward with this?

mb-ii · 2025-10-01T20:52:21Z

@jkhenning

mb-ii force-pushed the master branch from 841711b to 746cdd6 Compare July 28, 2025 16:38

mb-ii force-pushed the master branch 2 times, most recently from 1136bd2 to 48b51a8 Compare August 2, 2025 11:23

mb-ii force-pushed the master branch 2 times, most recently from db8747a to c81c02c Compare August 2, 2025 18:04

feat: log debug pod information before deleting pods by agent

3fc71e2

mb-ii force-pushed the master branch from c81c02c to 3fc71e2 Compare September 4, 2025 09:34

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

feat: log debug pod information before deleting pods by agent #249

feat: log debug pod information before deleting pods by agent #249

Uh oh!

mb-ii commented Jul 28, 2025

Uh oh!

jkhenning commented Jul 28, 2025

Uh oh!

mb-ii commented Jul 29, 2025

Uh oh!

jkhenning commented Jul 30, 2025

Uh oh!

mb-ii commented Aug 2, 2025

Uh oh!

mb-ii commented Sep 4, 2025

Uh oh!

mb-ii commented Sep 4, 2025

Uh oh!

mb-ii commented Sep 16, 2025

Uh oh!

mb-ii commented Oct 1, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

feat: log debug pod information before deleting pods by agent #249

Are you sure you want to change the base?

feat: log debug pod information before deleting pods by agent #249

Uh oh!

Conversation

mb-ii commented Jul 28, 2025

Uh oh!

jkhenning commented Jul 28, 2025

Uh oh!

mb-ii commented Jul 29, 2025

Uh oh!

jkhenning commented Jul 30, 2025

Uh oh!

mb-ii commented Aug 2, 2025

Uh oh!

mb-ii commented Sep 4, 2025

Uh oh!

mb-ii commented Sep 4, 2025

Uh oh!

mb-ii commented Sep 16, 2025

Uh oh!

mb-ii commented Oct 1, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants