-
Notifications
You must be signed in to change notification settings - Fork 106
feat: log debug pod information before deleting pods by agent #249
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: master
Are you sure you want to change the base?
Conversation
|
Hi @mb-ii, Thanks for this contribution! Since this introduces a new behavior, I'd like to limit the effect on existing usages and suggest a few changes.
|
|
@jkhenning Thanks for the feedback! My only concert with this approach is the accuracy. When a pod is running for hours or days, and then is just fails and disappears, its hard and costly to reproduce what happened. If we don't break it up into 2 steps (first list, then delete), rather we delete everything matching a selector, we're risking race conditions. This would not be a debugging feature for us, in a sense that we would turn it on sometimes, rather we'd like to have this information always. Do you see a different way to achieve this? |
|
@mb-ii I have not issue with supporting both and choosing the behavior using an env var - so that users which have many pods will not experience any delay |
1136bd2 to
48b51a8
Compare
|
@jkhenning I've implemented the change according to your suggestions. I've also fixed 3 other small issues I came across. I've tested the changes by building an image and using it in a staging environment. |
db8747a to
c81c02c
Compare
|
Hey @jkhenning, were you able to take a look at this by any chance? We've built a container with these changes and been using it in production for the last few weeks without issues. It would be great if we could use an official build. |
|
Hey @jkhenning, any chance we can move forward with this? |
The ClearML Agent immediately deletes pods that have completed, either successfully or unsuccessfully. This is not always desired because it makes debugging hard in cases where pods failed for random reasons. For example, ML workloads often run out of memory, which results in OOM events that kill pods. Once a pod is killed, the agent deletes it, but without any message or reason visible in the ClearML UI. This PR adds debug logging inside agent process that can be used to understand the state of pods that were deleted.