Skip to content

Conversation

giancarloromeo
Copy link
Contributor

@giancarloromeo giancarloromeo commented Aug 20, 2025

What do these changes do?

This PR fixes a critical issue with Celery's Redis client lifecycle management where Redis connections were not properly cleaned up, leading to resource leaks. The fix ensures proper initialization and shutdown of Redis clients throughout the application.

  • Centralizes Redis client lifecycle management for Celery operations
  • Refactors worker initialization to remove redundant celery_settings parameter
  • Updates method names for consistency (lifespan → start_and_hold)

BONUS:

This PR fixes an issue that caused hanging tests. Now they complete in ~3m (btw. they always complete)

image

Related issue/s

How to test

cd packages/celery-library
make tests

Dev-ops

@giancarloromeo giancarloromeo self-assigned this Aug 20, 2025
@giancarloromeo giancarloromeo added a:celery-library a:storage issue related to storage service labels Aug 20, 2025
@giancarloromeo giancarloromeo added this to the Voyager milestone Aug 20, 2025
Copy link

codecov bot commented Aug 20, 2025

Codecov Report

❌ Patch coverage is 30.76923% with 9 lines in your changes missing coverage. Please review.
✅ Project coverage is 86.70%. Comparing base (35e7048) to head (8125b5d).

Additional details and impacted files
@@            Coverage Diff             @@
##           master    #8237      +/-   ##
==========================================
- Coverage   88.03%   86.70%   -1.34%     
==========================================
  Files        1919     1449     -470     
  Lines       74341    59854   -14487     
  Branches     1305      682     -623     
==========================================
- Hits        65449    51894   -13555     
+ Misses       8499     7733     -766     
+ Partials      393      227     -166     
Flag Coverage Δ
integrationtests 59.60% <ø> (-4.63%) ⬇️
unittests 86.29% <30.76%> (-0.39%) ⬇️
Components Coverage Δ
pkg_aws_library ∅ <ø> (∅)
pkg_celery_library 85.20% <100.00%> (-2.18%) ⬇️
pkg_dask_task_models_library ∅ <ø> (∅)
pkg_models_library ∅ <ø> (∅)
pkg_notifications_library ∅ <ø> (∅)
pkg_postgres_database ∅ <ø> (∅)
pkg_service_integration ∅ <ø> (∅)
pkg_service_library 72.37% <0.00%> (+0.02%) ⬆️
pkg_settings_library ∅ <ø> (∅)
pkg_simcore_sdk 65.29% <ø> (-19.75%) ⬇️
agent 93.53% <ø> (ø)
api_server 92.84% <ø> (ø)
autoscaling 95.89% <ø> (ø)
catalog 92.34% <ø> (ø)
clusters_keeper 99.13% <ø> (ø)
dask_sidecar 92.37% <ø> (+0.56%) ⬆️
datcore_adapter 97.94% <ø> (ø)
director 75.81% <ø> (ø)
director_v2 85.40% <ø> (-5.52%) ⬇️
dynamic_scheduler 96.27% <ø> (ø)
dynamic_sidecar 89.19% <ø> (-0.91%) ⬇️
efs_guardian 89.62% <ø> (ø)
invitations 91.44% <ø> (ø)
payments 92.61% <ø> (ø)
resource_usage_tracker 92.13% <ø> (+0.21%) ⬆️
storage ∅ <ø> (∅)
webclient ∅ <ø> (∅)
webserver 88.10% <ø> (-0.04%) ⬇️

Continue to review full report in Codecov by Sentry.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 35e7048...8125b5d. Read the comment docs.

🚀 New features to boost your workflow:
  • 📦 JS Bundle Analysis: Save yourself from yourself by tracking and limiting bundle sizes in JS merges.

Copy link
Contributor

mergify bot commented Aug 20, 2025

🧪 CI Insights

Here's what we observed from your CI run for 8125b5d.

❌ Failed Jobs

Pipeline Job Health on base branch Retries 🔍 CI Insights 📄 Logs
CI integration-tests Healthy 0 View View
system-tests Healthy 0 View View
unit-tests Healthy 0 View View

@giancarloromeo giancarloromeo requested a review from Copilot August 20, 2025 19:54
Copy link
Contributor

@Copilot Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull Request Overview

This PR fixes a critical issue with Celery's Redis client lifecycle management where Redis connections were not properly cleaned up, leading to resource leaks. The fix ensures proper initialization and shutdown of Redis clients throughout the application.

  • Centralizes Redis client lifecycle management for Celery operations
  • Refactors worker initialization to remove redundant celery_settings parameter
  • Updates method names for consistency (lifespan → start_and_hold)

Reviewed Changes

Copilot reviewed 10 out of 10 changed files in this pull request and generated 1 comment.

Show a summary per file
File Description
services/storage/src/simcore_service_storage/modules/celery/__init__.py Complete refactor to manage Redis client lifecycle with proper setup/shutdown
services/storage/src/simcore_service_storage/core/application.py Updated to use new setup_celery function name and reorganized conditional logic
packages/service-library/src/servicelib/fastapi/celery/app_server.py Added task_manager property and renamed lifespan method to start_and_hold
packages/service-library/src/servicelib/celery/app_server.py Made task_manager abstract property and renamed lifespan method
packages/celery-library/src/celery_library/signals.py Simplified worker initialization by removing redundant celery_settings parameter
packages/celery-library/src/celery_library/common.py Removed create_task_manager function that was causing lifecycle issues
services/storage/tests/conftest.py Updated test fixture to match simplified worker initialization
packages/celery-library/tests/conftest.py Updated test fixtures with proper Redis client lifecycle management
packages/service-library/src/servicelib/celery/models.py Fixed parameter name from task_context to task_filter

@giancarloromeo giancarloromeo changed the title 🐛 Fix Celery's Redis client lifecycle 🐛 Ensure Proper Redis Client Shutdown in Celery Aug 22, 2025
@giancarloromeo giancarloromeo changed the title 🐛 Ensure Proper Redis Client Shutdown in Celery 🐛 Ensure proper Redis client shutdown in Celery Aug 22, 2025
@giancarloromeo giancarloromeo marked this pull request as ready for review August 22, 2025 11:17
Copy link
Collaborator

@matusdrobuliak66 matusdrobuliak66 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks

Copy link
Contributor

@GitHK GitHK left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just some minor things

Copy link
Member

@sanderegg sanderegg left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

not sure I completely get where the issue was? in the celery package tests?
Why are we not just using the real thing in the celery package tests instead of that complicated fake?

task_manager: TaskManager = self.app.state.task_manager
return task_manager

async def start_and_hold(self, startup_completed_event: threading.Event) -> None:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it is a lifespan and the one problem I see here is the returned type that is wrong. It should be AsyncIterator[None] which would remove the confusion I think

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We don't yield anything here. This is the place in which the initialized FastAPI instance stays parked waiting for the shutdown event.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since this is the primary entrypoint for this service, i would call it run_until_shutdown that emphasizes the lifecycle clearly (and reminds the
naming from asyncio library).

Regarding @sanderegg comment.

In other parts of the code our approach is to provide a context-manager like function that includes setup&tear-down parts in one place (see https://github.com/ITISFoundation/osparc-simcore/blob/master/packages/service-library/src/servicelib/fastapi/postgres_lifespan.py#L31C11-L31C37).

This approach here is difference since this member function encapsulates the setup&tear-down parts AND runs it. That reduces the flexibility but I guess you do not need it here.

I understand this function also can only be called once. Therefore I would add a protection for it

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

TIP: use log_context(INFO,...) instead of _logger.info

task_manager: TaskManager = self.app.state.task_manager
return task_manager

async def start_and_hold(self, startup_completed_event: threading.Event) -> None:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since this is the primary entrypoint for this service, i would call it run_until_shutdown that emphasizes the lifecycle clearly (and reminds the
naming from asyncio library).

Regarding @sanderegg comment.

In other parts of the code our approach is to provide a context-manager like function that includes setup&tear-down parts in one place (see https://github.com/ITISFoundation/osparc-simcore/blob/master/packages/service-library/src/servicelib/fastapi/postgres_lifespan.py#L31C11-L31C37).

This approach here is difference since this member function encapsulates the setup&tear-down parts AND runs it. That reduces the flexibility but I guess you do not need it here.

I understand this function also can only be called once. Therefore I would add a protection for it


@abstractmethod
async def lifespan(
async def start_and_hold(
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

check my other comment about renaming this

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

for interfaces, plaease add some doc about what is expected, specially
when the name does not reveals all details

task_manager: TaskManager = self.app.state.task_manager
return task_manager

async def start_and_hold(self, startup_completed_event: threading.Event) -> None:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

TIP: use log_context(INFO,...) instead of _logger.info

async def lifespan(self, startup_completed_event: threading.Event) -> None:
@property
def task_manager(self) -> TaskManager:
task_manager = self.app.state.task_manager
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The order in which the app state is setup is very important and here I do not see how this is guaranteed. Can you please show me offline how the workflow works?

Copy link

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
a:celery-library a:storage issue related to storage service
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Celery worker not calling shutdown on Redis instance
5 participants