Skip to content

[Bug]: fit Failure: Message contains an Error (reason: Error: Message Unavailable - The requested message could not be found in the database. It may have expired due to its TTL or never existed.) #6185

@xiaoyanshen799

Description

@xiaoyanshen799

Describe the bug

Describe the bug

I’m using Flower 1.20.0 with the standard ServerApp/ClientApp components. Each client runs a small FEMNIST CNN, torch.set_num_threads(1) is enforced in client_app, and the strategy is FedAvg(accept_failures=True).

Every client runs in its own container. The image is based on Ubuntu 22.04, bundles Python 3, the CPU build of PyTorch, Flower 1.20.0, and the usual training dependencies. Each container launches python -m flwr run clientapp=<client app module> --server-address <host:port> so that all containers connect to the same Flower server.

When a container starts, I pass its client ID, dataset partition path, and training hyperparameters (batch size, learning rate, local epochs, etc.) via environment variables/CLI arguments. The client process immediately reads its FEMNIST slice from a shared volume, creates the dataloaders, and then connects to the Flower server to wait for tasks.

On an 8‑vCPU host everything works when I start at most 4 concurrent Flower clients (≤50 % of the CPU cores). The moment I launch 5 or more clients, the server begins to log repeated fit Failure: Message contains an Error (reason: Error: Message Unavailable - The requested message could not be found in the database. It may have expired due to its TTL or never existed.). It originated during client-side execution of a message.. Part of the clients fail to response even though the clients are still alive. This looks like the server loses track of some in-flight FitIns results.

Environment

  • Flower flwr==1.20.0
  • Python 3.12.3
  • OS: Ubuntu 24.04.3 LTS

Steps/Code to Reproduce

running in this containerize system: https://github.com/xiaoyanshen799/fleet

Expected Results

Clients should keep alive and response to server.

Actual Results

In each round, some clients response correctly while the others don't and show the error: fit Failure: Message contains an Error (reason: Error: Message Unavailable - The requested message could not be found in the database. It may have expired due to its TTL or never existed.)

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions