Skip to content

Merge --base-docker-image and --docker-image flag #585

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 5 commits into
base: develop
Choose a base branch
from

Conversation

ycchenzheng
Copy link
Collaborator

Fixes / Features

  • Merge --base-docker-image and --docker-image flag

Testing / Documentation

Tested with https://github.com/AI-Hypercomputer/maxtext/blob/wstcliyu/pw-405b-scale-test/benchmarks/recipes/pw_mcjax_benchmark_recipe.py for both mcjax and pathways
Changed https://github.com/AI-Hypercomputer/maxtext/blob/wstcliyu/pw-405b-scale-test/benchmarks/maxtext_xpk_runner.py#L624 to

    docker_image_flag = f'--docker-image="{wl_config.base_docker_image}"'

mcjax uses RUNNER = "maxtext_base_image" and pathways uses RUNNER="gcr.io/tpu-prod-env-multipod/wstcliyu_latest:latest" as runner image
mcjax will push local maxtext_base_image to remote for pods to pull and pathways will pull images directly from the remote.
XPK log:

[XPK] Building /usr/local/google/home/chzheng/maxtext into docker image.
[XPK] Task: `Building script_dir into docker image` is implemented by `docker buildx build --platform=linux/amd64 -f /tmp/tmpvl105_wh -t chzheng-runner /usr/local/google/home/chzheng/maxtext`, streaming output live.
[+] Building 0.0s (0/1)                                                                                                                                  docker:default
[+] Building 0.9s (1/2)                                                                                                                                  docker:default
[+] Building 2.0s (6/8)                                                                                                                                  docker:default
[+] Building 3.0s (8/9)                                                                                                                                  docker:default
[+] Building 3.8s (9/9) FINISHED                                                                                                                         docker:default
 => [internal] load build definition from tmpvl105_wh                                                                                                              0.0s
 => => transferring dockerfile: 212B                                                                                                                               0.0s
 => [internal] load metadata for docker.io/library/python:3.10                                                                                                     1.3s
 => [internal] load .dockerignore                                                                                                                                  0.0s
 => => transferring context: 45B                                                                                                                                   0.0s
 => [1/4] FROM docker.io/library/python:3.10@sha256:6ff000548a4fa34c1be02624836e75e212d4ead8227b4d4381c3ae998933a922                                               0.0s
 => [internal] load build context                                                                                                                                  0.0s
 => => transferring context: 39.83kB                                                                                                                               0.0s
 => CACHED [2/4] WORKDIR /app                                                                                                                                      0.0s
 => [3/4] COPY . .                                                                                                                                                 1.3s
 => [4/4] WORKDIR /app                                                                                                                                             0.0s
 => exporting to image                                                                                                                                             1.0s
 => => exporting layers                                                                                                                                            1.0s
 => => writing image sha256:8f7f59fdd22171fa0ac861a9e7559c4c58f80978950a7bb04eaf2ee37f004ffd                                                                       0.0s
 => => naming to docker.io/library/chzheng-runner                                                                                                                  0.0s
Waiting for `chzhe-pw-2-wtf`, for 14 secondsdocker image`, for 4 seconds...
[XPK] Task: `Building script_dir into docker image` terminated with code `0`
[XPK] Adding Docker Image: gcr.io/tpu-prod-env-one-vm/chzheng-runner:xitg-2025-08-08-17-56-39 to tpu-prod-env-one-vm
[XPK] Task: `Tag Docker Image` is implemented by `docker tag chzheng-runner gcr.io/tpu-prod-env-one-vm/chzheng-runner:xitg-2025-08-08-17-56-39`, streaming output live.
Waiting for `chzhe-pw-2-wtf`, for 15 secondsseconds...
[XPK] Task: `Tag Docker Image` terminated with code `0`
[XPK] Task: `Upload Docker Image` is implemented by `docker push gcr.io/tpu-prod-env-one-vm/chzheng-runner:xitg-2025-08-08-17-56-39`, streaming output live.
Waiting for `chzhe-pw-2-wtf`, for 16 seconds 0 seconds...
Waiting for `chzhe-pw-2-wtf`, for 17 seconds 1 seconds...
The push refers to repository [gcr.io/tpu-prod-env-one-vm/chzheng-runner]
5f70bf18a086: Layer already exists 
917a4b2a5731: Pushing [=>                                                 ]  7.141MB/191.7MB
917a4b2a5731: Pushing [=====>                                             ]  20.51MB/191.7MB
917a4b2a5731: Pushing [========>                                          ]   33.3MB/191.7MB
917a4b2a5731: Pushing [===========>                                       ]  44.44MB/191.7MB
917a4b2a5731: Pushing [==============>                                    ]  54.42MB/191.7MB
917a4b2a5731: Pushing [================>                                  ]  64.45MB/191.7MB
917a4b2a5731: Pushing [====================>                              ]  77.25MB/191.7MB
917a4b2a5731: Pushing [=======================>                           ]  88.93MB/191.7MB
917a4b2a5731: Pushing [==========================>                        ]  101.2MB/191.7MB
917a4b2a5731: Pushing [=============================>                     ]  112.9MB/191.7MB
917a4b2a5731: Pushing [================================>                  ]  124.6MB/191.7MB
917a4b2a5731: Pushing [===================================>               ]  135.1MB/191.7MB
917a4b2a5731: Pushing [======================================>            ]  146.3MB/191.7MB
917a4b2a5731: Pushing [========================================>          ]  156.3MB/191.7MB
917a4b2a5731: Pushing [===========================================>       ]    168MB/191.7MB
917a4b2a5731: Pushing [==============================================>    ]  179.6MB/191.7MB
917a4b2a5731: Pushing [=================================================> ]  190.2MB/191.7MB
917a4b2a5731: Pushed 
Waiting for `chzhe-pw-2-wtf`, for 27 seconds 11 seconds...
Waiting for `chzhe-pw-2-wtf`, for 28 seconds 12 seconds...
Waiting for `chzhe-pw-2-wtf`, for 29 seconds 13 seconds...
Waiting for `chzhe-pw-2-wtf`, for 30 seconds 14 seconds...
Waiting for `chzhe-pw-2-wtf`, for 31 seconds 15 seconds...
Waiting for `chzhe-pw-2-wtf`, for 32 seconds 16 seconds...
Waiting for `chzhe-pw-2-wtf`, for 33 seconds 17 seconds...
Waiting for `chzhe-pw-2-wtf`, for 34 seconds 18 seconds...
Waiting for `chzhe-pw-2-wtf`, for 35 seconds 19 seconds...
Waiting for `chzhe-pw-2-wtf`, for 36 seconds 20 seconds...
xitg-2025-08-08-17-56-39: digest: sha256:766a71cc50f9c8100c98ccba5d451dbbb82d15f6b275ee884f1b2cb278153f36 size: 2420
[XPK] Task: `Upload Docker Image` terminated with code `0`

Pod log:

Events:
  Type     Reason                           Age                From               Message
  ----     ------                           ----               ----               -------
  Normal   Scheduled                        32s                default-scheduler  Successfully assigned default/chzhe-pw-2-wtf-slice-job-1-0-t8nn6 to gke-tpu-e96bd525-tn7c
  Normal   Pulling                          32s                kubelet            Pulling image "gcr.io/tpu-prod-env-one-vm/chzheng-runner:xitg-2025-08-08-17-56-39"
  Normal   Pulled                           28s                kubelet            Successfully pulled image "gcr.io/tpu-prod-env-one-vm/chzheng-runner:xitg-2025-08-08-17-56-39" in 4.365s (4.365s including waiting). Image size: 455215023 bytes.
  Normal   Created                          28s                kubelet            Created container: jax-tpu
  Normal   Started                          27s                kubelet            Started container jax-tpu
  Warning  FailedToRetrieveImagePullSecret  26s (x3 over 32s)  kubelet            Unable to retrieve some image pull secrets (None); attempting to pull the image may not succeed.

  • [ y ] Tests pass
  • [ y ] Appropriate changes to documentation are included in the PR

@ycchenzheng ycchenzheng self-assigned this Aug 8, 2025
@ycchenzheng
Copy link
Collaborator Author

@SujeethJinesh

@ycchenzheng ycchenzheng force-pushed the chzheng/docker_image_flag branch from 3d26153 to 954377f Compare August 8, 2025 18:39
@ycchenzheng ycchenzheng force-pushed the chzheng/docker_image_flag branch 2 times, most recently from 36b0cfc to 95e6fa0 Compare August 11, 2025 01:04
@ycchenzheng ycchenzheng force-pushed the chzheng/docker_image_flag branch from 95e6fa0 to b70f1b8 Compare August 11, 2025 18:05
@SujeethJinesh
Copy link
Collaborator

Hmm, I'm rethinking if we should be merging these flags together. I think we should still support both of these flags, but when we're using the benchmark runner in maxtext, we should support Pathways being able to use --base-docker-image or --docker-image

@ycchenzheng
Copy link
Collaborator Author

Hmm, I'm rethinking if we should be merging these flags together. I think we should still support both of these flags, but when we're using the benchmark runner in maxtext, we should support Pathways being able to use --base-docker-image or --docker-image

https://github.com/AI-Hypercomputer/xpk/blob/chzheng/docker_image_flag/src/xpk/core/docker_image.py#L228 will check --docker-image -> --base-docker-image -> DEFAULT_DOCKER_IMAGE
This change is still back compatible

Copy link
Collaborator

@SujeethJinesh SujeethJinesh left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks Zheng!

Copy link
Collaborator

@SujeethJinesh SujeethJinesh left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks Zheng!

Once the commented code is removed, then it looks good to me.

@ycchenzheng ycchenzheng force-pushed the chzheng/docker_image_flag branch from 6bf4461 to 08a7f36 Compare August 12, 2025 21:45
@ycchenzheng
Copy link
Collaborator Author

Thanks Zheng!

Once the commented code is removed, then it looks good to me.

Done

@SujeethJinesh
Copy link
Collaborator

@scaliby Would you be able to take a look at this PR?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants