docs(trainer): fix architecture diagrams and documentation accuracy

sh4shv4t · sh4shv4t · commit 11700f742f6b · 2026-02-07T04:42:32.000+05:30
- docker.md: show logs streaming from all nodes, clarify conditional cleanup
- podman.md: correct architecture text (Docker→Podman), align diagram with implementation, remove incorrect workflow details
- local_process.md: update diagram to reflect bash script generation and single subprocess execution

These changes address reviewer feedback and align documentation with actual SDK implementation.

Signed-off-by: sh4shv4t &lt;shashvat.k.singh.16@gmail.com&gt;
diff --git a/content/en/docs/components/trainer/user-guides/local-execution-mode/docker.md b/content/en/docs/components/trainer/user-guides/local-execution-mode/docker.md
@@ -35,7 +35,8 @@ graph LR
     end
     
     C1 -->|4. Logs| Logs[Stream Logs]
-    C1 -.->|5. Clean| Remove[Auto-Remove]
+    C2 -->|4. Logs| Logs
+    SDK -.->|"5. Cleanup (if auto_remove)"| Remove[Delete Containers & Network]
 ```
 
 ## Prerequisites
diff --git a/content/en/docs/components/trainer/user-guides/local-execution-mode/local_process.md b/content/en/docs/components/trainer/user-guides/local-execution-mode/local_process.md
@@ -225,18 +225,17 @@ The Local Process Backend operates by orchestrating native OS processes. It bypa
 graph LR
     User([User Script]) -->|TrainerClient.train| SDK[Kubeflow SDK]
     
-    SDK -->|1. Create| Venv[Python Venv]
-    Venv -->|2. Install| Deps[Dependencies]
-    SDK -->|3. Extract| Script[Training Script .py]
+    SDK -->|1. Generate| Script[Bash Script]
     
-    subgraph LocalExec [Local Execution]
+    subgraph LocalExec ["Single Subprocess (bash -c)"]
         direction TB
-        Deps --> Process[Python Process]
-        Script --> Process
+        Script -->|2. Execute| Venv[Create Venv + pip]
+        Venv --> Deps[Install Dependencies]
+        Deps --> Train[Run Entrypoint]
+        Train -.-> Clean["Delete Venv (if cleanup_venv)"]
     end
     
-    Process -->|4. Logs| Logs[Stream Logs]
-    Process -.->|5. Clean| Cleanup[Delete Venv]
+    Train -->|3. Logs| Logs[Stream Logs]
 ```
 
 ## How It Works
diff --git a/content/en/docs/components/trainer/user-guides/local-execution-mode/podman.md b/content/en/docs/components/trainer/user-guides/local-execution-mode/podman.md
@@ -170,35 +170,30 @@ backend_config = ContainerBackendConfig(
 
 ## Architecture
 
-The Container Backend with Docker uses a local orchestration layer to manage TrainJobs within Docker containers. This ensures environment parity between your local machine and production Kubernetes clusters.
+The Container Backend with Podman uses a local orchestration layer to manage TrainJobs within Podman containers. This ensures environment parity between your local machine and production Kubernetes clusters.
 
 ```mermaid
 graph LR
     User([User Script]) -->|TrainerClient.train| SDK[Kubeflow SDK]
     
-    SDK -->|1. Prep| PodConfig[Podman Config]
-    SDK -->|2. Mount| LocalDir[Local Dir Mounts]
-    SDK -->|3. Exec| Podman[Podman CLI/API]
+    SDK -->|1. Pull| Image[Podman Image]
+    SDK -->|2. Net| Net[DNS-Enabled Bridge Network]
+    SDK -->|3. Run| Podman[Podman Engine]
     
-    subgraph PodmanEnv [Podman Container - Rootless]
+    subgraph PodmanEnv [Local Podman Environment]
         direction TB
-        Podman --> Process[Training Process]
-        Process --> Security[User Namespace Isolation]
+        Podman -->|Spawn| C1[Node 0]
+        Podman -->|Spawn| C2[Node 1]
+        C1 <-->|DDP| C2
     end
     
-    Process -->|4. Logs| Logs[Stream Logs]
-    Process -->|5. Clean| Exit[Exit & Cleanup]
+    C1 -->|4. Logs| Logs[Stream Logs]
+    C2 -->|4. Logs| Logs
+    SDK -.->|"5. Cleanup (if auto_remove)"| Remove[Delete Containers & Network]
 ```
 
 
 
-### Workflow Detail
-1. **Image Management:** The SDK identifies the required training image. If `pull_policy` is set, it ensures the latest image is available.
-2. **Network Creation:** A dedicated Docker bridge network is created for the job to allow containers (nodes) to communicate via hostnames (e.g., `job-node-0`).
-3. **Container Spawning:** The SDK instructs the Docker Daemon to start containers. It injects environment variables like `MASTER_ADDR`, `MASTER_PORT`, `RANK`, and `WORLD_SIZE` to enable distributed frameworks (e.g., PyTorch DDP).
-4. **Log Streaming:** Logs are streamed from the containers back to the SDK's `TrainerClient`.
-5. **Lifecycle Management:** Once the training process exits, the SDK handles the removal of containers and the temporary network if `auto_remove=True`.
-
 ## Multi-Node Distributed Training
 
 The Podman backend automatically sets up networking and environment variables for distributed training: