Skip to content

Commit 7e67171

Browse files
authored
Introducing Task Runner workspace to enable interoperability with Flower workloads (#1433)
* enable flwr workspace Signed-off-by: kta-intel <[email protected]> * add tmp dir to minimize how much patching is needed Signed-off-by: kta-intel <[email protected]> * add hash verification to setup data Signed-off-by: kta-intel <[email protected]> * remove patch Signed-off-by: kta-intel <[email protected]> * fix save location Signed-off-by: kta-intel <[email protected]> * docstring Signed-off-by: kta-intel <[email protected]> * remove superfluous edit Signed-off-by: kta-intel <[email protected]> * update readme Signed-off-by: kta-intel <[email protected]> * remove debugger Signed-off-by: kta-intel <[email protected]> * update aggregator client Signed-off-by: kta-intel <[email protected]> * formatting Signed-off-by: kta-intel <[email protected]> * remove todo Signed-off-by: kta-intel <[email protected]> * minor functionality fixes Signed-off-by: kta-intel <[email protected]> * update docstrings Signed-off-by: kta-intel <[email protected]> * update docstrings Signed-off-by: kta-intel <[email protected]> * update README instructions Signed-off-by: kta-intel <[email protected]> * do not add connector to settings unless connector exists Signed-off-by: kta-intel <[email protected]> * check attribute for connector availability Signed-off-by: kta-intel <[email protected]> * formatting Signed-off-by: kta-intel <[email protected]> * formatting Signed-off-by: kta-intel <[email protected]> * code cleanup Signed-off-by: kta-intel <[email protected]> * fix readme Signed-off-by: kta-intel <[email protected]> * update plan to reflect refactoring Signed-off-by: kta-intel <[email protected]> * grammar fix Signed-off-by: kta-intel <[email protected]> * remove __all__ Signed-off-by: kta-intel <[email protected]> * remove superfluous init Signed-off-by: kta-intel <[email protected]> * change flwr home dir Signed-off-by: kta-intel <[email protected]> * remove patch comment, .sort() added in flwr 1.16 Signed-off-by: kta-intel <[email protected]> * update name for grpc protocols and components Signed-off-by: kta-intel <[email protected]> * remove Connector abc Signed-off-by: kta-intel <[email protected]> * refactoring local grpc to interop Signed-off-by: kta-intel <[email protected]> * import fixes Signed-off-by: kta-intel <[email protected]> * remove duplicate self.callback Signed-off-by: kta-intel <[email protected]> * fix hashes Signed-off-by: kta-intel <[email protected]> * move into workspace for relative path installation Signed-off-by: kta-intel <[email protected]> * update readme to remove connector ABC Signed-off-by: kta-intel <[email protected]> --------- Signed-off-by: kta-intel <[email protected]>
1 parent c589186 commit 7e67171

37 files changed

+2161
-32
lines changed

openfl-docker/Dockerfile.workspace

Lines changed: 2 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -19,7 +19,8 @@ WORKDIR /
1919
ARG WORKSPACE_NAME
2020
COPY ${WORKSPACE_NAME}.zip /workspace.zip
2121
RUN fx workspace import --archive /workspace.zip && \
22-
pip install --no-cache-dir -r /workspace/requirements.txt
22+
cd /workspace && \
23+
pip install --no-cache-dir -r ./requirements.txt
2324

2425
# Build enclaves
2526
WORKDIR /workspace

openfl-workspace/flower-app-pytorch/.workspace

Whitespace-only changes.
Lines changed: 267 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,267 @@
1+
# Open(FL)ower
2+
3+
This workspace demonstrates a new functionality in OpenFL to interoperate with [Flower](https://flower.ai/). In particular, a user can now use the Flower API to run on OpenFL infrastructure. OpenFL will act as an intermediary step between the Flower SuperLink and Flower SuperNode to relay messages across the network using OpenFL's transport mechanisms.
4+
5+
## Overview
6+
7+
In this repository, you'll notice a directory under `src` called `app-pytorch`. This is essentially a Flower PyTorch app created using Flower's `flwr new` command that has been modified to run a local federation. The `client_app.py` and `server_app.py` dictate what will be run by the client and server respectively. `task.py` defines the logic that will be executed by each app, such as the model definition, train/test tasks, etc. Under `server_app.py` a section titled "Save Model" is added in order to save the `best.pbuf` and `last.pbuf` models from the experiment in your local workspace under `./save`. This uses native OpenFL logic to store the model as a `.pbuf` in order to later be retrieved by `fx model save` into a native format (limited to `.npz` to be deep learning framework agnostic), but this can be overridden to save the model directly following Flower's recommended method for [saving model checkpoints](https://flower.ai/docs/framework/how-to-save-and-load-model-checkpoints.html).
8+
9+
## Getting Started
10+
11+
### Install OpenFL
12+
13+
Follow the [installation guide](https://openfl.readthedocs.io/en/latest/installation.html).
14+
15+
### Create a Workspace
16+
17+
Start by creating a workspace:
18+
19+
```sh
20+
fx workspace create --template flower-app-pytorch --prefix my_workspace
21+
cd my_workspace
22+
```
23+
24+
This will create a workspace in your current working directory called `./my_workspace` as well as install the Flower app defined in `./app-pytorch.` This will be where the experiment takes place.
25+
26+
### Configure the Experiment
27+
Notice under `./plan`, you will find the familiar OpenFL YAML files to configure the experiment. `cols.yaml` and `data.yaml` will be populated by the collaborators that will run the Flower client app and the respective data shard or directory they will perform their training and testing on.
28+
`plan.yaml` configures the experiment itself. The Open-Flower integration makes a few key changes to the `plan.yaml`:
29+
30+
1. Introduction of a new top-level key (`connector`) to configure a newly introduced component called `ConnectorFlower`. This component is run by the aggregator and is responsible for initializing the Flower `SuperLink` and connecting to the OpenFL server. The `SuperLink` parameters can be configured using `connector.settings.superlink_params`. If nothing is supplied, it will simply run `flower-superlink --insecure` with the command's default settings as dictated by Flower. It also includes the option to run the flwr run command via `connector.settings.flwr_run_params`. If `flwr_run_params` are not provided, the user will be expected to run `flwr run <app>` from the aggregator machine to initiate the experiment.
31+
32+
```yaml
33+
connector:
34+
defaults: plan/defaults/connector.yaml
35+
template: openfl.component.ConnectorFlower
36+
settings:
37+
superlink_params:
38+
insecure: True
39+
serverappio-api-address: 127.0.0.1:9091
40+
fleet-api-address: 127.0.0.1:9092
41+
exec-api-address: 127.0.0.1:9093
42+
flwr_run_params:
43+
flwr_app_name: "app-pytorch"
44+
federation_name: "local-poc"
45+
```
46+
47+
2. `FlowerTaskRunner` which will execute the `start_client_adapter` task. This task starts the Flower SuperNode and makes a connection to the OpenFL client. Additionally, the `FlowerTaskRunner` has an additional setting `FlowerTaskRunner.settings.auto_shutdown` which is default set to `True`. When set to `True`, the task runner will shut the SuperNode at the completion of an experiment, otherwise, it will run continuously.
48+
49+
```yaml
50+
task_runner:
51+
defaults: plan/defaults/task_runner.yaml
52+
template: openfl.federated.task.runner_flower.FlowerTaskRunner
53+
settings:
54+
auto_shutdown: True
55+
```
56+
57+
3. `FlowerDataLoader` with similar high-level functionality to other dataloaders.
58+
59+
**IMPORTANT NOTE**: `aggregator.settings.rounds_to_train` is set to 1. __Do not edit this__. The actual number of rounds for the experiment is controlled by Flower logic inside of `./app-pytorch/pyproject.toml`. The entirety of the Flower experiment will run in a single OpenFL round. Increasing this will cause OpenFL to attempt to run the experiment again. The aggregator round is there to stop the OpenFL components at the completion of the experiment.
60+
61+
4. `Task` - we introduce a `tasks_connector.yaml` that will allow the collaborator to connect to Flower framework via the local gRPC server. It also handles the task runner's `start_client_adapter` method, which actually starts the Flower component and local gRPC server. By setting `local_server_port` to 0, the port is dynamically allocated. This is mainly for local experiments to avoid overlapping the ports.
62+
63+
```yaml
64+
tasks:
65+
settings:
66+
connect_to: Flower
67+
start_client_adapter:
68+
function: start_client_adapter
69+
kwargs:
70+
local_server_port: 0
71+
```
72+
73+
## Execution Methods
74+
There are two ways to execute this:
75+
76+
1. Automatic shutdown which will spawn a `server-app` in isolation and trigger an experiment termination once the it shuts down. (Default/Recommended)
77+
2. Running `SuperLink` and `SuperNode` as [long-lived components](#long-lived-superlink-and-supernode) that will indefinitely wait for new runs. (Limited Functionality)
78+
79+
## Running the Workspace
80+
We proceed with the automatic shutdown method of execution.
81+
Run the workspace as normal (certify the workspace, initialize the plan, register the collaborators, etc.):
82+
83+
```SH
84+
# Generate a Certificate Signing Request (CSR) for the Aggregator
85+
fx aggregator generate-cert-request
86+
87+
# The CA signs the aggregator's request, which is now available in the workspace
88+
fx aggregator certify --silent
89+
90+
# Initialize FL Plan and Model Weights for the Federation
91+
fx plan initialize
92+
93+
################################
94+
# Setup Collaborator 1
95+
################################
96+
97+
# Create a collaborator named "collaborator1" that will use shard "0"
98+
fx collaborator create -n collaborator1 -d data/1
99+
100+
# Generate a CSR for collaborator1
101+
fx collaborator generate-cert-request -n collaborator1
102+
103+
# The CA signs collaborator1's certificate
104+
fx collaborator certify -n collaborator1 --silent
105+
106+
################################
107+
# Setup Collaborator 2
108+
################################
109+
110+
# Create a collaborator named "collaborator2" that will use shard "1"
111+
fx collaborator create -n collaborator2 -d data/2
112+
113+
# Generate a CSR for collaborator2
114+
fx collaborator generate-cert-request -n collaborator2
115+
116+
# The CA signs collaborator2's certificate
117+
fx collaborator certify -n collaborator2 --silent
118+
119+
##############################
120+
# Start to Run the Federation
121+
##############################
122+
123+
# Run the Aggregator
124+
fx aggregator start
125+
```
126+
127+
This will prepare the workspace and start the OpenFL aggregator, Flower superlink, and Flower serverapp. You should see something like:
128+
129+
```SH
130+
INFO 🧿 Starting the Aggregator Service.
131+
.
132+
.
133+
.
134+
INFO : Starting Flower SuperLink
135+
WARNING : Option `--insecure` was set. Starting insecure HTTP server.
136+
INFO : Flower Deployment Engine: Starting Exec API on 127.0.0.1:9093
137+
INFO : Flower ECE: Starting ServerAppIo API (gRPC-rere) on 127.0.0.1:9091
138+
INFO : Flower ECE: Starting Fleet API (GrpcAdapter) on 127.0.0.1:9092
139+
.
140+
.
141+
.
142+
INFO : [INIT]
143+
INFO : Using initial global parameters provided by strategy
144+
INFO : Starting evaluation of initial global parameters
145+
INFO : Evaluation returned no results (`None`)
146+
INFO :
147+
INFO : [ROUND 1]
148+
```
149+
150+
### Start Collaborators
151+
Open 2 additional terminals for collaborators.
152+
For collaborator 1's terminal, run:
153+
```SH
154+
fx collaborator start -n collaborator1
155+
```
156+
For collaborator 2's terminal, run:
157+
```SH
158+
fx collaborator start -n collaborator2
159+
```
160+
This will start the collaborator nodes, the Flower `SuperNode`, and Flower `ClientApp`, and begin running the Flower experiment. You should see something like:
161+
162+
```SH
163+
INFO 🧿 Starting a Collaborator Service.
164+
.
165+
.
166+
.
167+
INFO : Starting Flower SuperNode
168+
WARNING : Option `--insecure` was set. Starting insecure HTTP channel to 127.0.0.1:...
169+
INFO : Starting Flower ClientAppIo gRPC server on 127.0.0.1:...
170+
INFO :
171+
INFO : [RUN 297994661073077505, ROUND 1]
172+
```
173+
### Completion of the Experiment
174+
Upon the completion of the experiment, on the `aggregator` terminal, the Flower components should send an experiment summary as the `SuperLink `continues to receive requests from the supernode:
175+
```SH
176+
INFO : [SUMMARY]
177+
INFO : Run finished 3 round(s) in 93.29s
178+
INFO : History (loss, distributed):
179+
INFO : round 1: 2.0937052175497555
180+
INFO : round 2: 1.8027011854633406
181+
INFO : round 3: 1.6812996898487116
182+
```
183+
If `automatic_shutdown` is enabled, this will be shortly followed by the OpenFL `aggregator` receiving "results" from the `collaborator` and subsequently shutting down:
184+
185+
```SH
186+
INFO Round 0: Collaborators that have completed all tasks: ['collaborator1', 'collaborator2']
187+
INFO Experiment Completed. Cleaning up...
188+
INFO Sending signal to collaborator collaborator2 to shutdown...
189+
INFO Sending signal to collaborator collaborator1 to shutdown...
190+
INFO [OpenFL Connector] Stopping server process with PID: ...
191+
INFO : SuperLink terminated gracefully.
192+
INFO [OpenFL Connector] Server process stopped.
193+
```
194+
Upon the completion of the experiment, on the `collaborator` terminals, the Flower components should be outputting the information about the run:
195+
196+
```SH
197+
INFO : [RUN ..., ROUND 3]
198+
INFO : Received: evaluate message
199+
INFO : Start `flwr-clientapp` process
200+
INFO : [flwr-clientapp] Pull `ClientAppInputs` for token ...
201+
INFO : [flwr-clientapp] Push `ClientAppOutputs` for token ...
202+
```
203+
204+
If `automatic_shutdown` is enabled, this will be shortly followed by the OpenFL `collaborator` shutting down:
205+
206+
```SH
207+
INFO : SuperNode terminated gracefully.
208+
INFO SuperNode process terminated.
209+
INFO Shutting down local gRPC server...
210+
INFO local gRPC server stopped.
211+
INFO Waiting for tasks...
212+
INFO Received shutdown signal. Exiting...
213+
```
214+
Congratulations, you have run a Flower experiment through OpenFL's task runner!
215+
216+
## Advanced Usage
217+
### Long-lived SuperLink and SuperNode
218+
A user can set `automatic_shutdown: False` in the `Connector` settings of the `plan.yaml`.
219+
220+
```yaml
221+
connector :
222+
defaults : plan/defaults/connector.yaml
223+
template : openfl.component.ConnectorFlower
224+
settings :
225+
automatic_shutdown: False
226+
```
227+
228+
By doing so, Flower's `ServerApp` and `ClientApp` will still shut down at the completion of the Flower experiment, but the `SuperLink` and `SuperNode` will continue to run. As a result, on the `aggregator` terminal, you will see a constant request coming from the `SuperNode`:
229+
230+
```SH
231+
INFO : GrpcAdapter.PullTaskIns
232+
INFO : GrpcAdapter.PullTaskIns
233+
INFO : GrpcAdapter.PullTaskIns
234+
```
235+
You can run another experiment by opening another terminal, navigating to this workspace, and running:
236+
```SH
237+
flwr run ./src/app-pytorch
238+
```
239+
It will run another experiment. Once you are done, you can manually shut down OpenFL's `collaborator` and Flower's `SuperNode` with `CTRL+C`. This will trigger a task-completion by the task runner that'll subsequently begin the graceful shutdown process of the OpenFL and Flower components.
240+
241+
### Running in SGX Enclave
242+
Gramine does not support all Linux system calls. Flower FAB is built and installed at runtime. During this, `utime()` is called, which is an [unsupported call](https://gramine.readthedocs.io/en/latest/devel/features.html#list-of-system-calls), resulting in error or unexpected behavior. To navigate this, when running in an SGX enclave, we opt to build and install the FAB during initialization and package it alongside the OpenFL workspace. To make this work, we introduce some patches to Flower's build command, which helps circumvent the unsupported system call as well as minimize read/write access.
243+
244+
To run these patches, simply add `patch: True` to the `Connector` and `Task Runner` settings. For the `Task Runner` also include the name of the Flower app for building and installation.
245+
246+
```yaml
247+
connector :
248+
defaults : plan/defaults/connector.yaml
249+
template : openfl.component.ConnectorFlower
250+
settings :
251+
superlink_params :
252+
insecure : True
253+
serverappio-api-address : 127.0.0.1:9091
254+
fleet-api-address : 127.0.0.1:9092
255+
exec-api-address : 127.0.0.1:9093
256+
flwr_run_params :
257+
flwr_app_name : "app-pytorch"
258+
federation_name : "local-poc"
259+
patch : True
260+
261+
task_runner :
262+
defaults : plan/defaults/task_runner.yaml
263+
template : openfl.federated.task.runner_flower.FlowerTaskRunner
264+
settings :
265+
patch : True
266+
flwr_app_name : "app-pytorch"
267+
```
Lines changed: 5 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,5 @@
1+
# Copyright (C) 2024 Intel Corporation
2+
# Licensed subject to the terms of the separately executed evaluation license agreement between Intel Corporation and you.
3+
4+
collaborators:
5+
Lines changed: 2 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,2 @@
1+
# Copyright (C) 2024 Intel Corporation
2+
# Licensed subject to the terms of the separately executed evaluation license agreement between Intel Corporation and you.
Lines changed: 62 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,62 @@
1+
# Copyright (C) 2024 Intel Corporation
2+
# Licensed subject to the terms of the separately executed evaluation license agreement between Intel Corporation and you.
3+
4+
aggregator :
5+
defaults : plan/defaults/aggregator.yaml
6+
template : src.aggregator.AggregatorFlower
7+
settings :
8+
rounds_to_train : 1 # DO NOT EDIT. This is to indicate OpenFL communication rounds
9+
persist_checkpoint : false
10+
write_logs : false
11+
12+
connector :
13+
defaults : plan/defaults/connector.yaml
14+
template : src.connector_flower.ConnectorFlower
15+
settings :
16+
superlink_params :
17+
insecure : True
18+
serverappio-api-address : 127.0.0.1:9091
19+
fleet-api-address : 127.0.0.1:9092
20+
exec-api-address : 127.0.0.1:9093
21+
flwr_run_params :
22+
flwr_app_name : "app-pytorch"
23+
federation_name : "local-poc"
24+
patch: True
25+
26+
collaborator :
27+
defaults : plan/defaults/collaborator.yaml
28+
template : src.collaborator.CollaboratorFlower
29+
30+
data_loader :
31+
defaults : plan/defaults/data_loader.yaml
32+
template : src.loader.FlowerDataLoader
33+
settings :
34+
collaborator_count : 2
35+
36+
task_runner :
37+
defaults : plan/defaults/task_runner.yaml
38+
template : src.runner.FlowerTaskRunner
39+
settings :
40+
flwr_app_name: app-pytorch
41+
patch: True
42+
43+
network :
44+
defaults : plan/defaults/network.yaml
45+
46+
assigner :
47+
defaults : plan/defaults/assigner.yaml
48+
template : openfl.component.RandomGroupedAssigner
49+
settings :
50+
task_groups :
51+
- name : Connector_Flower
52+
percentage : 1.0
53+
tasks :
54+
- start_client_adapter
55+
56+
tasks :
57+
defaults : plan/defaults/tasks_connector.yaml
58+
settings :
59+
connect_to : Flower
60+
61+
compression_pipeline :
62+
defaults : plan/defaults/compression_pipeline.yaml
Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1 @@
1+
./src/app-pytorch

0 commit comments

Comments
 (0)