- 
                Notifications
    You must be signed in to change notification settings 
- Fork 58
          Merge --base-docker-image and --docker-image flag
          #585
        
          New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
3d26153    to
    954377f      
    Compare
  
    36b0cfc    to
    95e6fa0      
    Compare
  
    95e6fa0    to
    b70f1b8      
    Compare
  
    | Hmm, I'm rethinking if we should be merging these flags together. I think we should still support both of these flags, but when we're using the benchmark runner in maxtext, we should support Pathways being able to use  | 
| 
 https://github.com/AI-Hypercomputer/xpk/blob/chzheng/docker_image_flag/src/xpk/core/docker_image.py#L228 will check  | 
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks Zheng!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks Zheng!
Once the commented code is removed, then it looks good to me.
6bf4461    to
    08a7f36      
    Compare
  
    | 
 Done | 
| @scaliby Would you be able to take a look at this PR? | 
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for this change! Could you please also add XPK execution command to the XPK log, so I will see with what arguments XPK has been executed?
        
          
                src/xpk/core/docker_image.py
              
                Outdated
          
        
      |  | ||
| Args: | ||
| args: user provided arguments for running the command. | ||
| is_cloud_image = any( | 
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why do we need distinction between cloud and local images? What if I push my local ubuntu image to gcr and want to use it as a local image? It will be treated as a cloud image.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You are right. It should be based on the flag --script-dir
I will update it and verify again
@SujeethJinesh
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It is updated to check --script-dir and is_cloud_image to determine if building a new image is needed.
I attached the commands and logs below
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why do we need this is_cloud_image check at all? IMO this is completely wrong, as if I as a researcher store my image outside of GCP I might still want to build scripts right?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The original purpose of is_cloud_image is to check if it is a local image, if so, we need to push the local image to cloud for pods to pull.
  if (
      args.script_dir and args.script_dir != DEFAULT_SCRIPT_DIR
  ) or not is_cloud_image:
    validate_code = validate_docker_image(docker_image, args)
    if validate_code != 0:
      xpk_exit(validate_code)
    build_code, docker_image = build_docker_image_from_base_image(args)
    if build_code != 0:
      xpk_exit(build_code)
And you are right, it will force the user to build and push the image if the image is outside of GCP.
I will update it. Thanks for the review!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@scaliby I updated the logic and please review it one more time.
Following is the logs:
Local image(maxtext_base_image):
XPK will build and push the image
[XPK] Docker image: maxtext_base_image
[XPK] Local image: True, build and push
[XPK] Building /usr/local/google/home/chzheng/maxtext into docker image.
[XPK] Task: `Building script_dir into docker image` is implemented by `docker buildx build --platform=linux/amd64 -f /tmp/tmplf2ae_b1 -t chzheng-runner /usr/local/google/home/chzheng/maxtext`, streaming output live.
Waiting for `chz-pw-llama3--3-e1g`, for 9 secondsr image`, for 0 seconds...
[+] Building 0.8s (1/2)                                                                                                                                  docker:default
[+] Building 1.7s (6/8)                                                                                                                                  docker:default
[+] Building 2.5s (9/9) FINISHED                                                                                                                         docker:default
 => [internal] load build definition from tmplf2ae_b1                                                                                                              0.0s
 => => transferring dockerfile: 212B                                                                                                                               0.0s
 => [internal] load metadata for docker.io/library/python:3.10                                                                                                     1.3s
 => [internal] load .dockerignore                                                                                                                                  0.0s
 => => transferring context: 45B                                                                                                                                   0.0s
 => [1/4] FROM docker.io/library/python:3.10@sha256:b568817037b2e14fb0d5d0deb972556ac15a915eb85e254bbc66e4eae446aa54                                               0.0s
 => [internal] load build context                                                                                                                                  0.1s
 => => transferring context: 66.74kB                                                                                                                               0.1s
 => CACHED [2/4] WORKDIR /app                                                                                                                                      0.0s
 => [3/4] COPY . .                                                                                                                                                 0.6s
 => [4/4] WORKDIR /app                                                                                                                                             0.0s
 => exporting to image                                                                                                                                             0.5s
 => => exporting layers                                                                                                                                            0.5s
 => => writing image sha256:caf3e50099b114251c19a2481570758515b5051fec14654144b9ff865c500ed6                                                                       0.0s
 => => naming to docker.io/library/chzheng-runner                                                                                                                  0.0s
[XPK] Task: `Building script_dir into docker image` terminated with code `0`
[XPK] Adding Docker Image: gcr.io/cloud-tpu-multipod-dev/chzheng-runner:adsc-2025-09-26-20-04-54 to cloud-tpu-multipod-dev
[XPK] Task: `Tag Docker Image` is implemented by `docker tag chzheng-runner gcr.io/cloud-tpu-multipod-dev/chzheng-runner:adsc-2025-09-26-20-04-54`, streaming output live.
Waiting for `chz-pw-llama3--3-e1g`, for 12 secondss...
[XPK] Task: `Tag Docker Image` terminated with code `0`
[XPK] Task: `Upload Docker Image` is implemented by `docker push gcr.io/cloud-tpu-multipod-dev/chzheng-runner:adsc-2025-09-26-20-04-54`, streaming output live.
Waiting for `chz-pw-llama3--3-e1g`, for 13 secondsonds...
The push refers to repository [gcr.io/cloud-tpu-multipod-dev/chzheng-runner]
5f70bf18a086: Preparing 
5f70bf18a086: Layer already exists 
ad25f8cf280d: Pushing [=======>                                           ]  11.03MB/72.57MB
ad25f8cf280d: Pushing [================>                                  ]  24.29MB/72.57MB
ad25f8cf280d: Pushing [=========================>                         ]  37.45MB/72.57MB
ad25f8cf280d: Pushing [=================================>                 ]  48.58MB/72.57MB
ad25f8cf280d: Pushing [========================================>          ]   58.6MB/72.57MB
ad25f8cf280d: Pushing [================================================>  ]   70.3MB/72.57MB
ad25f8cf280d: Pushed 
607ddfe5f3c3: Layer already exists 
185e04da9d94: Layer already exists 
Waiting for `chz-pw-llama3--3-e1g`, for 15 secondsonds...
Waiting for `chz-pw-llama3--3-e1g`, for 16 secondsonds...
Waiting for `chz-pw-llama3--3-e1g`, for 17 secondsonds...
Waiting for `chz-pw-llama3--3-e1g`, for 18 secondsonds...
Waiting for `chz-pw-llama3--3-e1g`, for 19 secondsonds...
Waiting for `chz-pw-llama3--3-e1g`, for 20 secondsonds...
Waiting for `chz-pw-llama3--3-e1g`, for 21 secondsonds...
Waiting for `chz-pw-llama3--3-e1g`, for 22 secondsonds...
adsc-2025-09-26-20-04-54: digest: sha256:dd5b481a3ad529d67fb0d24c0d025293b4df6858383b3b8a0662fb61ff267bb5 size: 2420
[XPK] Task: `Upload Docker Image` terminated with code `0`
Events:
  Type    Reason     Age   From               Message
  ----    ------     ----  ----               -------
  Normal  Scheduled  12s   default-scheduler  Successfully assigned default/chz-pw-llama3--3-e1g-pathways-head-0-0-jq8rh to gke-pw-scale-test-v5e-32-cpu-np-157a9698-zl4g
  Normal  Pulling    12s   kubelet            Pulling image "us-docker.pkg.dev/cloud-tpu-v2-images-dev/pathways/unsanitized_server:latest"
  Normal  Pulled     12s   kubelet            Successfully pulled image "us-docker.pkg.dev/cloud-tpu-v2-images-dev/pathways/unsanitized_server:latest" in 271ms (271ms including waiting). Image size: 189202517 bytes.
  Normal  Created    12s   kubelet            Created container: pathways-rm
  Normal  Started    12s   kubelet            Started container pathways-rm
  Normal  Pulling    11s   kubelet            Pulling image "us-docker.pkg.dev/cloud-tpu-v2-images-dev/pathways/unsanitized_proxy_server:latest"
  Normal  Pulled     11s   kubelet            Successfully pulled image "us-docker.pkg.dev/cloud-tpu-v2-images-dev/pathways/unsanitized_proxy_server:latest" in 315ms (315ms including waiting). Image size: 176032745 bytes.
  Normal  Created    11s   kubelet            Created container: pathways-proxy
  Normal  Started    11s   kubelet            Started container pathways-proxy
  Normal  Pulling    10s   kubelet            Pulling image "gcr.io/cloud-tpu-multipod-dev/chzheng-runner:adsc-2025-09-26-20-04-54"
  Normal  Pulled     10s   kubelet            Successfully pulled image "gcr.io/cloud-tpu-multipod-dev/chzheng-runner:adsc-2025-09-26-20-04-54" in 330ms (330ms including waiting). Image size: 432209311 bytes.
  Normal  Created    10s   kubelet            Created container: jax-tpu
  Normal  Started    10s   kubelet            Started container jax-tpu
Cloud image outside of GCP (python:3.12):
XPK will not build image if no --script-dir, since it is outside of GCP, validate_docker_image will return 0 and output no logs
[XPK] Docker image: python:3.12
Waiting for `chz-pw-llama3--3-7xs`, for 9 seconds
[XPK] Remote image, move forward
Events:
  Type    Reason     Age   From               Message
  ----    ------     ----  ----               -------
  Normal  Scheduled  37s   default-scheduler  Successfully assigned default/chz-pw-llama3--3-7xs-pathways-head-0-0-8vl6r to gke-pw-scale-test-v5e-32-cpu-np-157a9698-zl4g
  Normal  Pulling    38s   kubelet            Pulling image "us-docker.pkg.dev/cloud-tpu-v2-images-dev/pathways/unsanitized_server:latest"
  Normal  Pulled     38s   kubelet            Successfully pulled image "us-docker.pkg.dev/cloud-tpu-v2-images-dev/pathways/unsanitized_server:latest" in 367ms (367ms including waiting). Image size: 189202517 bytes.
  Normal  Created    38s   kubelet            Created container: pathways-rm
  Normal  Started    38s   kubelet            Started container pathways-rm
  Normal  Pulling    37s   kubelet            Pulling image "us-docker.pkg.dev/cloud-tpu-v2-images-dev/pathways/unsanitized_proxy_server:latest"
  Normal  Pulled     36s   kubelet            Successfully pulled image "us-docker.pkg.dev/cloud-tpu-v2-images-dev/pathways/unsanitized_proxy_server:latest" in 247ms (247ms including waiting). Image size: 176032745 bytes.
  Normal  Created    36s   kubelet            Created container: pathways-proxy
  Normal  Started    36s   kubelet            Started container pathways-proxy
  Normal  Pulling    36s   kubelet            Pulling image "python:3.12"
  Normal  Pulled     35s   kubelet            Successfully pulled image "python:3.12" in 225ms (225ms including waiting). Image size: 410344954 bytes.
  Normal  Created    35s   kubelet            Created container: jax-tpu
  Normal  Started    35s   kubelet            Started container jax-tpu
Cloud image(gcr.io/tpu-prod-env-one-vm/chzheng_latest:latest):
XPK will not build image if no --script-dir
[XPK] Docker image: gcr.io/tpu-prod-env-one-vm/chzheng_latest:latest
[XPK] Remote image, move forward
[XPK] Task: `Validate Docker Image` is implemented by `gcloud container images describe gcr.io/tpu-prod-env-one-vm/chzheng_latest:latest --project cloud-tpu-multipod-dev`, hiding output unless there is an error.
Waiting for `chz-pw-llama3--3-866`, for 10 seconds
Waiting for `chz-pw-llama3--3-866`, for 11 seconds
[XPK] Task: `Validate Docker Image` succeeded.
Events:
  Type    Reason     Age   From               Message
  ----    ------     ----  ----               -------
  Normal  Scheduled  119s  default-scheduler  Successfully assigned default/chz-pw-llama3--3-866-pathways-head-0-0-9rqjk to gke-pw-scale-test-v5e-32-cpu-np-157a9698-zl4g
  Normal  Pulling    119s  kubelet            Pulling image "us-docker.pkg.dev/cloud-tpu-v2-images-dev/pathways/unsanitized_server:latest"
  Normal  Pulled     119s  kubelet            Successfully pulled image "us-docker.pkg.dev/cloud-tpu-v2-images-dev/pathways/unsanitized_server:latest" in 357ms (357ms including waiting). Image size: 189202517 bytes.
  Normal  Created    119s  kubelet            Created container: pathways-rm
  Normal  Started    119s  kubelet            Started container pathways-rm
  Normal  Pulling    119s  kubelet            Pulling image "us-docker.pkg.dev/cloud-tpu-v2-images-dev/pathways/unsanitized_proxy_server:latest"
  Normal  Pulled     118s  kubelet            Successfully pulled image "us-docker.pkg.dev/cloud-tpu-v2-images-dev/pathways/unsanitized_proxy_server:latest" in 375ms (375ms including waiting). Image size: 176032745 bytes.
  Normal  Created    118s  kubelet            Created container: pathways-proxy
  Normal  Started    118s  kubelet            Started container pathways-proxy
  Normal  Pulling    118s  kubelet            Pulling image "gcr.io/tpu-prod-env-one-vm/chzheng_latest:latest"
  Normal  Pulled     117s  kubelet            Successfully pulled image "gcr.io/tpu-prod-env-one-vm/chzheng_latest:latest" in 420ms (420ms including waiting). Image size: 1573876003 bytes.
  Normal  Created    117s  kubelet            Created container: jax-tpu
  Normal  Started    117s  kubelet            Started container jax-tpu
| 
 It will call https://github.com/AI-Hypercomputer/maxtext/blob/main/benchmarks/maxtext_xpk_runner.py which calls  It will build and push the image to remote And the pod event: If using a remote docker image, the XPK command: It will let the pod pull the target image directly from the remote  | 
a32daf1    to
    f27717a      
    Compare
  
    f27717a    to
    16e76b1      
    Compare
  
    | @scaliby can you please review this PR again? | 
| It's very hard for me to see all of the consequences of merging these two flags, as I'm not familiar with all of the usage cases enough (yet). Is merging of these flags just for convenience / simplification, or is there another reason to do it? | 
3bb4c1b    to
    a6b919e      
    Compare
  
    a6b919e    to
    64527de      
    Compare
  
    | 
 @SujeethJinesh can you please share more context here? | 
Fixes / Features
--base-docker-imageand--docker-imageflagTesting / Documentation
Tested with https://github.com/AI-Hypercomputer/maxtext/blob/wstcliyu/pw-405b-scale-test/benchmarks/recipes/pw_mcjax_benchmark_recipe.py for both
mcjaxandpathwaysChanged https://github.com/AI-Hypercomputer/maxtext/blob/wstcliyu/pw-405b-scale-test/benchmarks/maxtext_xpk_runner.py#L624 to
mcjaxusesRUNNER = "maxtext_base_image"andpathwaysusesRUNNER="gcr.io/tpu-prod-env-multipod/wstcliyu_latest:latest"as runner imagemcjaxwill push localmaxtext_base_imageto remote for pods to pull andpathwayswill pull images directly from the remote.XPK log:
Pod log: