Skip to content

Conversation

@sanderegg
Copy link
Member

@sanderegg sanderegg commented Oct 23, 2025

What do these changes do?

As it happens more and more this is a measure to protect the platform from long running jobs that are probably hanging and it protects the user from paying for jobs that are hanged.

If a job does not produce any logs during 1h a timeout triggers and will stop the job.

Bonus:

image

Related issue/s

How to test

Dev-ops

@sanderegg sanderegg self-assigned this Oct 23, 2025
@sanderegg sanderegg added a:dask-service Any of the dask services: dask-scheduler/sidecar or worker a:computational clusters labels Oct 23, 2025
@codecov
Copy link

codecov bot commented Oct 23, 2025

Codecov Report

❌ Patch coverage is 87.96296% with 13 lines in your changes missing coverage. Please review.
✅ Project coverage is 89.47%. Comparing base (de246e2) to head (69968f7).
⚠️ Report is 1 commits behind head on master.

Additional details and impacted files
@@            Coverage Diff             @@
##           master    #8549      +/-   ##
==========================================
+ Coverage   87.55%   89.47%   +1.92%     
==========================================
  Files        2012     1382     -630     
  Lines       78982    58411   -20571     
  Branches     1369      187    -1182     
==========================================
- Hits        69151    52264   -16887     
+ Misses       9425     6088    -3337     
+ Partials      406       59     -347     
Flag Coverage Δ
integrationtests 63.87% <ø> (-0.04%) ⬇️
unittests 87.75% <87.96%> (+1.46%) ⬆️
Components Coverage Δ
pkg_aws_library ∅ <ø> (∅)
pkg_celery_library ∅ <ø> (∅)
pkg_dask_task_models_library 79.37% <100.00%> (+0.36%) ⬆️
pkg_models_library ∅ <ø> (∅)
pkg_notifications_library ∅ <ø> (∅)
pkg_postgres_database ∅ <ø> (∅)
pkg_service_integration ∅ <ø> (∅)
pkg_service_library ∅ <ø> (∅)
pkg_settings_library ∅ <ø> (∅)
pkg_simcore_sdk 84.95% <ø> (ø)
agent 93.10% <ø> (ø)
api_server 91.35% <ø> (ø)
autoscaling 95.83% <ø> (ø)
catalog 92.06% <ø> (ø)
clusters_keeper 99.14% <ø> (ø)
dask_sidecar 91.51% <86.73%> (-0.32%) ⬇️
datcore_adapter 97.95% <ø> (ø)
director 75.72% <ø> (ø)
director_v2 90.90% <ø> (ø)
dynamic_scheduler 96.66% <ø> (ø)
dynamic_sidecar 90.44% <ø> (ø)
efs_guardian 89.83% <ø> (ø)
invitations 90.90% <ø> (ø)
payments 92.80% <ø> (ø)
resource_usage_tracker 92.16% <ø> (+0.05%) ⬆️
storage 86.84% <ø> (-0.09%) ⬇️
webclient ∅ <ø> (∅)
webserver 87.06% <ø> (+0.06%) ⬆️

Continue to review full report in Codecov by Sentry.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update de246e2...69968f7. Read the comment docs.

🚀 New features to boost your workflow:
  • 📦 JS Bundle Analysis: Save yourself from yourself by tracking and limiting bundle sizes in JS merges.

@sanderegg sanderegg force-pushed the computational-backend/stop-running-job-if-no-feedback branch from e6ada92 to 69cb65c Compare October 23, 2025 06:55
@mergify
Copy link
Contributor

mergify bot commented Oct 23, 2025

🧪 CI Insights

Here's what we observed from your CI run for 69968f7.

✅ Passed Jobs With Interesting Signals

Pipeline Job Signal Health on master Retries 🔍 CI Insights 📄 Logs
CI system-tests You had a 25% chance of failing… lucky you! 🎲 Flaky Configure an automatic retry View View
unit-tests You had a 50% chance of failing… lucky you! 🎲 Flaky Configure an automatic retry View View

@sanderegg sanderegg force-pushed the computational-backend/stop-running-job-if-no-feedback branch 2 times, most recently from fe881f5 to 0d6cfc4 Compare October 30, 2025 07:58
@sanderegg sanderegg force-pushed the computational-backend/stop-running-job-if-no-feedback branch from c7b9919 to 904e796 Compare November 6, 2025 12:59
@sanderegg sanderegg changed the title 🎨Computational backend: Automatically stop a running job if no logs are detected for 1h ✨🎨Computational backend: Automatically stop a running job if no logs are detected for 1h Nov 6, 2025
@sanderegg sanderegg requested a review from Copilot November 6, 2025 18:32
Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull Request Overview

This PR implements automatic termination of computational jobs that produce no logs for 1 hour, protecting the platform from hanging jobs and preventing unnecessary charges. It also improves error handling for out-of-memory scenarios with better log visibility.

Key changes:

  • Added configurable timeout (DASK_SIDECAR_MAX_LOG_SILENCE_TIMEOUT) for monitoring service log activity
  • Introduced new exception types (ServiceTimeoutLoggingError, ServiceOutOfMemoryError) with detailed error messages
  • Enhanced error handling and logging throughout the computational sidecar execution flow

Reviewed Changes

Copilot reviewed 7 out of 7 changed files in this pull request and generated 4 comments.

Show a summary per file
File Description
services/dask-sidecar/src/simcore_service_dask_sidecar/settings.py Added configuration for maximum log silence timeout (default: 1 hour)
packages/dask-task-models-library/src/dask_task_models_library/container_tasks/errors.py Introduced new exception hierarchy with ServiceTimeoutLoggingError and ServiceOutOfMemoryError
services/dask-sidecar/src/simcore_service_dask_sidecar/computational_sidecar/docker_utils.py Implemented timeout detection in log monitoring and improved error handling
services/dask-sidecar/src/simcore_service_dask_sidecar/computational_sidecar/core.py Added OOMKilled detection and enhanced error reporting for service execution
services/dask-sidecar/src/simcore_service_dask_sidecar/worker.py Added exception handler to avoid dask serialization issues
services/dask-sidecar/tests/unit/test_computational_sidecar_tasks.py Added comprehensive tests for timeout and OOM scenarios
services/dask-sidecar/tests/unit/conftest.py Added RAM resources to test cluster configuration

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Copy link
Collaborator

@matusdrobuliak66 matusdrobuliak66 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks a lot

Copy link
Member

@pcrespov pcrespov left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

thx!

Copy link
Contributor

@GitHK GitHK left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

thanks

@sanderegg sanderegg added the 🤖-automerge marks PR as ready to be merged for Mergify label Nov 7, 2025
@sanderegg
Copy link
Member Author

@mergify queue

@mergify
Copy link
Contributor

mergify bot commented Nov 7, 2025

queue

🛑 Configuration not compatible with a branch protection setting

The branch protection setting Require branches to be up to date before merging is not compatible with draft PR checks. To keep this branch protection enabled, update your Mergify configuration to enable in-place checks: set merge_queue.max_parallel_checks: 1, set every queue rule batch_size: 1, and avoid two-step CI (make merge_conditions identical to queue_conditions). Otherwise, disable this branch protection.

@matusdrobuliak66
Copy link
Collaborator

@sanderegg, maybe we should consider this an issue on our side and not bill the user? This can be done automatically if you send an ERROR state in the Rabbit message that is sent to RUT.

@sonarqubecloud
Copy link

sonarqubecloud bot commented Nov 7, 2025

@sanderegg sanderegg merged commit 059a777 into ITISFoundation:master Nov 7, 2025
146 of 148 checks passed
@sanderegg sanderegg deleted the computational-backend/stop-running-job-if-no-feedback branch November 7, 2025 15:11
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

🤖-automerge marks PR as ready to be merged for Mergify a:computational clusters a:dask-service Any of the dask services: dask-scheduler/sidecar or worker

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[Feature]: Ensure manually aborted task also upload the logs if any

5 participants