Describe the bug
Describe the bug
I’m using Flower 1.20.0 with the standard ServerApp/ClientApp components. Each client runs a small FEMNIST CNN, torch.set_num_threads(1) is enforced in client_app, and the strategy is FedAvg(accept_failures=True).
Every client runs in its own container. The image is based on Ubuntu 22.04, bundles Python 3, the CPU build of PyTorch, Flower 1.20.0, and the usual training dependencies. Each container launches python -m flwr run clientapp=<client app module> --server-address <host:port> so that all containers connect to the same Flower server.
When a container starts, I pass its client ID, dataset partition path, and training hyperparameters (batch size, learning rate, local epochs, etc.) via environment variables/CLI arguments. The client process immediately reads its FEMNIST slice from a shared volume, creates the dataloaders, and then connects to the Flower server to wait for tasks.
On an 8‑vCPU host everything works when I start at most 4 concurrent Flower clients (≤50 % of the CPU cores). The moment I launch 5 or more clients, the server begins to log repeated fit Failure: Message contains an Error (reason: Error: Message Unavailable - The requested message could not be found in the database. It may have expired due to its TTL or never existed.). It originated during client-side execution of a message.. Part of the clients fail to response even though the clients are still alive. This looks like the server loses track of some in-flight FitIns results.
Environment
- Flower
flwr==1.20.0
- Python 3.12.3
- OS: Ubuntu 24.04.3 LTS
Steps/Code to Reproduce
running in this containerize system: https://github.com/xiaoyanshen799/fleet
Expected Results
Clients should keep alive and response to server.
Actual Results
In each round, some clients response correctly while the others don't and show the error: fit Failure: Message contains an Error (reason: Error: Message Unavailable - The requested message could not be found in the database. It may have expired due to its TTL or never existed.)