-
Notifications
You must be signed in to change notification settings - Fork 32
✨🎨Computational backend: Automatically stop a running job if no logs are detected for 1h #8549
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
✨🎨Computational backend: Automatically stop a running job if no logs are detected for 1h #8549
Conversation
Codecov Report❌ Patch coverage is Additional details and impacted files@@ Coverage Diff @@
## master #8549 +/- ##
==========================================
+ Coverage 87.55% 89.47% +1.92%
==========================================
Files 2012 1382 -630
Lines 78982 58411 -20571
Branches 1369 187 -1182
==========================================
- Hits 69151 52264 -16887
+ Misses 9425 6088 -3337
+ Partials 406 59 -347
Continue to review full report in Codecov by Sentry.
🚀 New features to boost your workflow:
|
e6ada92 to
69cb65c
Compare
🧪 CI InsightsHere's what we observed from your CI run for 69968f7. ✅ Passed Jobs With Interesting Signals
|
fe881f5 to
0d6cfc4
Compare
c7b9919 to
904e796
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pull Request Overview
This PR implements automatic termination of computational jobs that produce no logs for 1 hour, protecting the platform from hanging jobs and preventing unnecessary charges. It also improves error handling for out-of-memory scenarios with better log visibility.
Key changes:
- Added configurable timeout (
DASK_SIDECAR_MAX_LOG_SILENCE_TIMEOUT) for monitoring service log activity - Introduced new exception types (
ServiceTimeoutLoggingError,ServiceOutOfMemoryError) with detailed error messages - Enhanced error handling and logging throughout the computational sidecar execution flow
Reviewed Changes
Copilot reviewed 7 out of 7 changed files in this pull request and generated 4 comments.
Show a summary per file
| File | Description |
|---|---|
| services/dask-sidecar/src/simcore_service_dask_sidecar/settings.py | Added configuration for maximum log silence timeout (default: 1 hour) |
| packages/dask-task-models-library/src/dask_task_models_library/container_tasks/errors.py | Introduced new exception hierarchy with ServiceTimeoutLoggingError and ServiceOutOfMemoryError |
| services/dask-sidecar/src/simcore_service_dask_sidecar/computational_sidecar/docker_utils.py | Implemented timeout detection in log monitoring and improved error handling |
| services/dask-sidecar/src/simcore_service_dask_sidecar/computational_sidecar/core.py | Added OOMKilled detection and enhanced error reporting for service execution |
| services/dask-sidecar/src/simcore_service_dask_sidecar/worker.py | Added exception handler to avoid dask serialization issues |
| services/dask-sidecar/tests/unit/test_computational_sidecar_tasks.py | Added comprehensive tests for timeout and OOM scenarios |
| services/dask-sidecar/tests/unit/conftest.py | Added RAM resources to test cluster configuration |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
services/dask-sidecar/src/simcore_service_dask_sidecar/worker.py
Outdated
Show resolved
Hide resolved
services/dask-sidecar/src/simcore_service_dask_sidecar/computational_sidecar/docker_utils.py
Show resolved
Hide resolved
services/dask-sidecar/src/simcore_service_dask_sidecar/computational_sidecar/docker_utils.py
Show resolved
Hide resolved
services/dask-sidecar/src/simcore_service_dask_sidecar/computational_sidecar/docker_utils.py
Show resolved
Hide resolved
matusdrobuliak66
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks a lot
pcrespov
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
thx!
services/dask-sidecar/src/simcore_service_dask_sidecar/computational_sidecar/core.py
Outdated
Show resolved
Hide resolved
services/dask-sidecar/src/simcore_service_dask_sidecar/worker.py
Outdated
Show resolved
Hide resolved
GitHK
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
thanks
services/dask-sidecar/src/simcore_service_dask_sidecar/computational_sidecar/docker_utils.py
Outdated
Show resolved
Hide resolved
services/dask-sidecar/src/simcore_service_dask_sidecar/computational_sidecar/docker_utils.py
Outdated
Show resolved
Hide resolved
|
@mergify queue |
🛑 Configuration not compatible with a branch protection settingThe branch protection setting |
|
@sanderegg, maybe we should consider this an issue on our side and not bill the user? This can be done automatically if you send an ERROR state in the Rabbit message that is sent to RUT. |
|



What do these changes do?
As it happens more and more this is a measure to protect the platform from long running jobs that are probably hanging and it protects the user from paying for jobs that are hanged.
If a job does not produce any logs during 1h a timeout triggers and will stop the job.
Bonus:
Related issue/s
How to test
Dev-ops