Skip to content

Conversation

@mb-ii
Copy link

@mb-ii mb-ii commented Jul 28, 2025

The ClearML Agent immediately deletes pods that have completed, either successfully or unsuccessfully. This is not always desired because it makes debugging hard in cases where pods failed for random reasons. For example, ML workloads often run out of memory, which results in OOM events that kill pods. Once a pod is killed, the agent deletes it, but without any message or reason visible in the ClearML UI. This PR adds debug logging inside agent process that can be used to understand the state of pods that were deleted.

@jkhenning
Copy link
Member

Hi @mb-ii,

Thanks for this contribution!

Since this introduces a new behavior, I'd like to limit the effect on existing usages and suggest a few changes.
The previous implementation sent a single command the delete all completed/failed pods - moving this to a per-pod command (after interrogating the exit status) will make it less stable for large-scale environments.
Since this is a debug feature, and so accuracy in terms of "sampling first and deleting later" will not have a huge effect on its usability, I suggest:

  • Adding an environment variable to control this option (e.g. K8S_GLUE_GET_POD_CLEANUP_INFO), only querying for the pods status and reporting it if this feature if on
  • Keeping the old "delete all matching pods" implementation and simply adding the new "interrogation" before doing so (and not individually deleting pods)
  • Logging the status not only to the debug log but also to the task log (so it will be visible in the ClearML UI as well)
  • Logging the exit status of all containers in the pod (in case there is more than one)
  • Moving this new code to a helper function (in a new python module under clearml_agent/glue/utilities.py or similar) to try and limit the growth of the k8s.py file 🙂

@mb-ii
Copy link
Author

mb-ii commented Jul 29, 2025

@jkhenning Thanks for the feedback! My only concert with this approach is the accuracy. When a pod is running for hours or days, and then is just fails and disappears, its hard and costly to reproduce what happened. If we don't break it up into 2 steps (first list, then delete), rather we delete everything matching a selector, we're risking race conditions. This would not be a debugging feature for us, in a sense that we would turn it on sometimes, rather we'd like to have this information always. Do you see a different way to achieve this?

@jkhenning
Copy link
Member

@mb-ii I have not issue with supporting both and choosing the behavior using an env var - so that users which have many pods will not experience any delay

@mb-ii mb-ii force-pushed the master branch 2 times, most recently from 1136bd2 to 48b51a8 Compare August 2, 2025 11:23
@mb-ii
Copy link
Author

mb-ii commented Aug 2, 2025

@jkhenning I've implemented the change according to your suggestions. I've also fixed 3 other small issues I came across. I've tested the changes by building an image and using it in a staging environment.

@mb-ii mb-ii force-pushed the master branch 2 times, most recently from db8747a to c81c02c Compare August 2, 2025 18:04
@mb-ii
Copy link
Author

mb-ii commented Sep 4, 2025

Hey @jkhenning, were you able to take a look at this by any chance? We've built a container with these changes and been using it in production for the last few weeks without issues. It would be great if we could use an official build.

@mb-ii
Copy link
Author

mb-ii commented Sep 4, 2025

An example of how output looks like:

ss

@mb-ii
Copy link
Author

mb-ii commented Sep 16, 2025

Hey @jkhenning, any chance we can move forward with this?

@mb-ii
Copy link
Author

mb-ii commented Oct 1, 2025

@jkhenning

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants