Skip to content

Commit 11700f7

Browse files
committed
docs(trainer): fix architecture diagrams and documentation accuracy
- docker.md: show logs streaming from all nodes, clarify conditional cleanup - podman.md: correct architecture text (Docker→Podman), align diagram with implementation, remove incorrect workflow details - local_process.md: update diagram to reflect bash script generation and single subprocess execution These changes address reviewer feedback and align documentation with actual SDK implementation. Signed-off-by: sh4shv4t <shashvat.k.singh.16@gmail.com>
1 parent 72015bc commit 11700f7

File tree

3 files changed

+20
-25
lines changed

3 files changed

+20
-25
lines changed

content/en/docs/components/trainer/user-guides/local-execution-mode/docker.md

Lines changed: 2 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -35,7 +35,8 @@ graph LR
3535
end
3636
3737
C1 -->|4. Logs| Logs[Stream Logs]
38-
C1 -.->|5. Clean| Remove[Auto-Remove]
38+
C2 -->|4. Logs| Logs
39+
SDK -.->|"5. Cleanup (if auto_remove)"| Remove[Delete Containers & Network]
3940
```
4041

4142
## Prerequisites

content/en/docs/components/trainer/user-guides/local-execution-mode/local_process.md

Lines changed: 7 additions & 8 deletions
Original file line numberDiff line numberDiff line change
@@ -225,18 +225,17 @@ The Local Process Backend operates by orchestrating native OS processes. It bypa
225225
graph LR
226226
User([User Script]) -->|TrainerClient.train| SDK[Kubeflow SDK]
227227
228-
SDK -->|1. Create| Venv[Python Venv]
229-
Venv -->|2. Install| Deps[Dependencies]
230-
SDK -->|3. Extract| Script[Training Script .py]
228+
SDK -->|1. Generate| Script[Bash Script]
231229
232-
subgraph LocalExec [Local Execution]
230+
subgraph LocalExec ["Single Subprocess (bash -c)"]
233231
direction TB
234-
Deps --> Process[Python Process]
235-
Script --> Process
232+
Script -->|2. Execute| Venv[Create Venv + pip]
233+
Venv --> Deps[Install Dependencies]
234+
Deps --> Train[Run Entrypoint]
235+
Train -.-> Clean["Delete Venv (if cleanup_venv)"]
236236
end
237237
238-
Process -->|4. Logs| Logs[Stream Logs]
239-
Process -.->|5. Clean| Cleanup[Delete Venv]
238+
Train -->|3. Logs| Logs[Stream Logs]
240239
```
241240

242241
## How It Works

content/en/docs/components/trainer/user-guides/local-execution-mode/podman.md

Lines changed: 11 additions & 16 deletions
Original file line numberDiff line numberDiff line change
@@ -170,35 +170,30 @@ backend_config = ContainerBackendConfig(
170170

171171
## Architecture
172172

173-
The Container Backend with Docker uses a local orchestration layer to manage TrainJobs within Docker containers. This ensures environment parity between your local machine and production Kubernetes clusters.
173+
The Container Backend with Podman uses a local orchestration layer to manage TrainJobs within Podman containers. This ensures environment parity between your local machine and production Kubernetes clusters.
174174

175175
```mermaid
176176
graph LR
177177
User([User Script]) -->|TrainerClient.train| SDK[Kubeflow SDK]
178178
179-
SDK -->|1. Prep| PodConfig[Podman Config]
180-
SDK -->|2. Mount| LocalDir[Local Dir Mounts]
181-
SDK -->|3. Exec| Podman[Podman CLI/API]
179+
SDK -->|1. Pull| Image[Podman Image]
180+
SDK -->|2. Net| Net[DNS-Enabled Bridge Network]
181+
SDK -->|3. Run| Podman[Podman Engine]
182182
183-
subgraph PodmanEnv [Podman Container - Rootless]
183+
subgraph PodmanEnv [Local Podman Environment]
184184
direction TB
185-
Podman --> Process[Training Process]
186-
Process --> Security[User Namespace Isolation]
185+
Podman -->|Spawn| C1[Node 0]
186+
Podman -->|Spawn| C2[Node 1]
187+
C1 <-->|DDP| C2
187188
end
188189
189-
Process -->|4. Logs| Logs[Stream Logs]
190-
Process -->|5. Clean| Exit[Exit & Cleanup]
190+
C1 -->|4. Logs| Logs[Stream Logs]
191+
C2 -->|4. Logs| Logs
192+
SDK -.->|"5. Cleanup (if auto_remove)"| Remove[Delete Containers & Network]
191193
```
192194

193195

194196

195-
### Workflow Detail
196-
1. **Image Management:** The SDK identifies the required training image. If `pull_policy` is set, it ensures the latest image is available.
197-
2. **Network Creation:** A dedicated Docker bridge network is created for the job to allow containers (nodes) to communicate via hostnames (e.g., `job-node-0`).
198-
3. **Container Spawning:** The SDK instructs the Docker Daemon to start containers. It injects environment variables like `MASTER_ADDR`, `MASTER_PORT`, `RANK`, and `WORLD_SIZE` to enable distributed frameworks (e.g., PyTorch DDP).
199-
4. **Log Streaming:** Logs are streamed from the containers back to the SDK's `TrainerClient`.
200-
5. **Lifecycle Management:** Once the training process exits, the SDK handles the removal of containers and the temporary network if `auto_remove=True`.
201-
202197
## Multi-Node Distributed Training
203198

204199
The Podman backend automatically sets up networking and environment variables for distributed training:

0 commit comments

Comments
 (0)