Skip to content

Conversation

clumsy
Copy link
Contributor

@clumsy clumsy commented Oct 16, 2025

Adding support for custom kubernetes pod overlay.

This fixes #1067 and #1068.

This way we don't need to add a runopts for each spec parameter.

Test plan:
[x] added unit tests

@meta-cla meta-cla bot added the CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed. label Oct 16, 2025
@clumsy
Copy link
Contributor Author

clumsy commented Oct 16, 2025

Please take a look @kiukchung and @d4l3k and let me know if this something that looks good.

I'll then include more testing evidence if that's the case.

@clumsy clumsy force-pushed the feat/k8s_pod_overlay branch from 5d53679 to 1eaaa28 Compare October 16, 2025 21:02
@clumsy clumsy force-pushed the feat/k8s_pod_overlay branch from 1eaaa28 to ce37a51 Compare October 17, 2025 13:12
@codecov-commenter
Copy link

codecov-commenter commented Oct 17, 2025

Codecov Report

✅ All modified and coverable lines are covered by tests.
✅ Project coverage is 91.59%. Comparing base (79e14da) to head (661bfa3).
⚠️ Report is 4 commits behind head on main.

Additional details and impacted files
@@            Coverage Diff             @@
##             main    #1148      +/-   ##
==========================================
- Coverage   91.65%   91.59%   -0.07%     
==========================================
  Files          83       83              
  Lines        6483     6615     +132     
==========================================
+ Hits         5942     6059     +117     
- Misses        541      556      +15     
Flag Coverage Δ
unittests 91.59% <100.00%> (-0.07%) ⬇️

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

@clumsy clumsy force-pushed the feat/k8s_pod_overlay branch from ce37a51 to 661bfa3 Compare October 19, 2025 21:27
Copy link
Contributor

@kiukchung kiukchung left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

--

@kiukchung
Copy link
Contributor

Please take a look @kiukchung and @d4l3k and let me know if this something that looks good.

I'll then include more testing evidence if that's the case.

thanks for the update! LGTM. One small comment:

  1. should lists be appended to when overlaying? (we do this for our internal overlays)

@clumsy
Copy link
Contributor Author

clumsy commented Oct 20, 2025

That's a good question, @kiukchung ! Technically we override keys for dicts so by default it would perhaps be least surprising if we also override the list but I think we can add a support for append also.
Not sure what's the best approach though, e.g. I can think of several ways:

  1. prepend keys with +, e.g. +tolerations to append to tolerations, etc. and it should be OK for k8s where the spec is fixed, but what if +<key> is a legit key for some other scheduler? Alternatively, we can use something like key! to explicitly override
  2. use a special dict, e.g. {"tolerations": {"__append__": [{"key": "foo"}]}} for dict and tolerations: {__append__: [{key: foo}]} for yaml
  3. use separate keys for update/append, e.g. role.metadata["kubernetes"]["update"] = {"key": "value"} and role.metadata["kubernetes"]["append"] = {}
  4. use yaml-like use: tolerations: !extend [*tolerations, {key: new, operator: Exists}] and handle this in python list/dicts
  5. Pass a lambda inside role.metadata["kubernetes"] and let the user express the desired transformation:
role.metadata = {
    "kubernetes": lambda pod_dict: {
        **pod_dict,
        "spec": {
            **pod_dict["spec"],
            "tolerations": pod_dict["spec"].get("tolerations", []) + [{"key": "new"}]
        }
    }
}

and we can load yaml or whatever using code:

import yaml
import fsspec

with fsspec.open("file:///path/to/overlay.yaml", "r") as f:
     overlay_dict = yaml.safe_load(f)

role.metadata={
    "kubernetes": lambda pod: {
         **pod,
         "spec": {
             **pod["spec"],
             **overlay_dict.get("spec", {}),
             "tolerations": pod["spec"].get("tolerations", []) +  overlay_dict.get("spec", {}).get("tolerations", [])
         }
     }
}
  1. apply RFC 6902 on client side:
{
  "kubernetes": [
    {
      "op": "add",
      "path": "/spec/nodeSelector/gpu",
      "value": "true"
    },
    {
      "op": "add",
      "path": "/spec/tolerations/-",
      "value": {
        "key": "nvidia.com/gpu",
        "operator": "Exists"
      }
    }
  ]
}

I don't want to invent anything new here @kiukchung , it should be ideally some standard tools only, so perhaps a lambda would be ideal, especially since TorchX is investing in programmatic approaches.

@clumsy clumsy force-pushed the feat/k8s_pod_overlay branch from 661bfa3 to 483f7af Compare October 20, 2025 18:05
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Add node selector to KubernetesScheduler run opts

4 participants