feat: add option to validate k8s spec (#1152) #1153

clumsy · 2025-10-21T13:58:15Z

Added a new validate_spec option (defaults to False for now, but perhaps can be turned on by default later) for kubernetes scheduler

Test plan:
[x] added unit tests

clumsy · 2025-10-21T13:58:48Z

Please take a look @kiukchung @d4l3k @andywag @tonykao8080

torchx/schedulers/kubernetes_scheduler.py

kiukchung

LGTM overall, thanks for adding this! If there is no downside (e.g. validation takes too long) to enabling this by default, then we should.

codecov-commenter · 2025-10-23T16:10:18Z

Codecov Report

❌ Patch coverage is 95.00000% with 1 line in your changes missing coverage. Please review.
✅ Project coverage is 91.58%. Comparing base (5957532) to head (0ada93d).

Files with missing lines	Patch %	Lines
torchx/schedulers/kubernetes_scheduler.py	95.00%	1 Missing ⚠️

Additional details and impacted files

@@           Coverage Diff           @@
##             main    #1153   +/-   ##
=======================================
  Coverage   91.57%   91.58%           
=======================================
  Files          83       83           
  Lines        6599     6617   +18     
=======================================
+ Hits         6043     6060   +17     
- Misses        556      557    +1

Flag	Coverage Δ
unittests	`91.58% <95.00%> (+<0.01%)`	⬆️

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:

❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

kiukchung · 2025-10-23T19:07:40Z

@azzhipa can you fix this type error:

torchx/schedulers/kubernetes_scheduler.py:681:23 Undefined attribute [16]: `object` has no attribute `__getitem__`.

clumsy · 2025-10-23T19:45:16Z

Done, @kiukchung!
Set the default value to True.

Once we remove runopts this check can stay - it should be tested extensively by the time we remove these runopts.

kiukchung · 2025-10-23T22:05:51Z

Done, @kiukchung! Set the default value to True.

Once we remove runopts this check can stay - it should be tested extensively by the time we remove these runopts.

sounds good,

more pyre fixes:

torchx/schedulers/test/kubernetes_scheduler_test.py:114:16 Unused ignore [0]: The `pyre-ignore[16]` or `pyre-fixme[16]` comment is not suppressing type errors, please remove it.
torchx/schedulers/test/kubernetes_scheduler_test.py:138:12 Unused ignore [0]: The `pyre-ignore[16]` or `pyre-fixme[16]` comment is not suppressing type errors, please remove it.
torchx/schedulers/test/kubernetes_scheduler_test.py:520:56 Unused ignore [0]: The `pyre-ignore[16]` or `pyre-fixme[16]` comment is not suppressing type errors, please remove it.
torchx/schedulers/test/kubernetes_scheduler_test.py:929:51 Unused ignore [0]: The `pyre-ignore[16]` or `pyre-fixme[16]` comment is not suppressing type errors, please remove it.

I think the macos unittest failure suggests that the validation code tries to hit a public URL? If this is the case then we'd have to make sure we turn off validation for unittests and add a mock validator for the test-case that asserts the validation codepath.

clumsy · 2025-10-24T13:19:25Z

Yes, weird locally do I get these false-positives (probably not installing the package in editable more and picking old version):

torchx/schedulers/test/kubernetes_scheduler_test.py:114:16 Undefined attribute [16]: `object` has no attribute `__getitem__`.
torchx/schedulers/test/kubernetes_scheduler_test.py:137:12 Undefined attribute [16]: `object` has no attribute `__getitem__`.
torchx/schedulers/test/kubernetes_scheduler_test.py:518:16 Undefined attribute [16]: `object` has no attribute `__getitem__`.
torchx/schedulers/test/kubernetes_scheduler_test.py:927:24 Undefined attribute [16]: `object` has no attribute `__getitem__`.

Rebased and removed these for the remote check to pass, @kiukchung

Let me know if you're interested in migrating to ty. Although experimental it worked quite solid for me.

kiukchung · 2025-10-24T17:06:19Z

Yes, weird locally do I get these false-positives (probably not installing the package in editable more and picking old version):
torchx/schedulers/test/kubernetes_scheduler_test.py:114:16 Undefined attribute [16]: `object` has no attribute `__getitem__`.
torchx/schedulers/test/kubernetes_scheduler_test.py:137:12 Undefined attribute [16]: `object` has no attribute `__getitem__`.
torchx/schedulers/test/kubernetes_scheduler_test.py:518:16 Undefined attribute [16]: `object` has no attribute `__getitem__`.
torchx/schedulers/test/kubernetes_scheduler_test.py:927:24 Undefined attribute [16]: `object` has no attribute `__getitem__`.
Rebased and removed these for the remote check to pass, @kiukchung

Let me know if you're interested in migrating to ty. Although experimental it worked quite solid for me.

unfortuately we need to keep the type checker in sync with the one we use internally (which pyre) :(

kiukchung · 2025-10-24T17:07:12Z

@clumsy I'm going to figure out a way to give you access to kick off the CI workflow so that you're not blocked on me to validate that CI passes.

clumsy · 2025-10-24T21:23:26Z

As I was verifying a new version - I had an idea, @kiukchung.
Does it make sense to move this check to torchx run --dryrun path? We do call create_namespaced_custom_object with dryrun=All after all.

clumsy · 2025-10-24T21:31:59Z

Not sure if this violates the contract: https://meta-pytorch.org/torchx/main/runner.html#torchx.runner.Runner.dryrun in true sense.

For now I'll just fix tests, let me know your preference please, @kiukchung

clumsy · 2025-10-24T22:07:40Z

Done, this should pass the linter, pyre and tests now, @kiukchung
I think I still like an actual k8s dry_run: All (with a remote call) during actual submission (vs. trochx run --dryrun) as in the original proposal, but now enabled by default (per you suggestion) which I think is safe - since we make the exact same call later w/o dry_run option anyway.

kiukchung · 2025-10-24T22:32:10Z

@clumsy I'm going to figure out a way to give you access to kick off the CI workflow so that you're not blocked on me to validate that CI passes.

@clumsy just sent you an invite (check the email associated with your github account) once you accept you should be able to kick off the CI workflow.

clumsy · 2025-10-24T22:38:47Z

Thanks, @kiukchung - I can see all the controls now, will use next time to finalize the change!

meta-cla bot added the CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed. label Oct 21, 2025

kiukchung reviewed Oct 23, 2025

View reviewed changes

torchx/schedulers/kubernetes_scheduler.py Show resolved Hide resolved

kiukchung approved these changes Oct 23, 2025

View reviewed changes

clumsy force-pushed the feat/k8s_validate_pod_spec branch from e7ecaec to 1d663e8 Compare October 23, 2025 19:45

feat: add option to validate k8s spec (meta-pytorch#1152)

c1a6a40

clumsy force-pushed the feat/k8s_validate_pod_spec branch from 1d663e8 to 40b89f2 Compare October 24, 2025 13:16

clumsy force-pushed the feat/k8s_validate_pod_spec branch from 40b89f2 to c1a6a40 Compare October 24, 2025 22:04

Merge branch 'main' into feat/k8s_validate_pod_spec

0ada93d

feat: add option to validate k8s spec (#1152) #1153

Are you sure you want to change the base?

feat: add option to validate k8s spec (#1152) #1153

Uh oh!

Conversation

clumsy commented Oct 21, 2025

Uh oh!

clumsy commented Oct 21, 2025

Uh oh!

Uh oh!

kiukchung left a comment

Choose a reason for hiding this comment

Uh oh!

codecov-commenter commented Oct 23, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

kiukchung commented Oct 23, 2025

Uh oh!

clumsy commented Oct 23, 2025

Uh oh!

kiukchung commented Oct 23, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

clumsy commented Oct 24, 2025

Uh oh!

kiukchung commented Oct 24, 2025

Uh oh!

kiukchung commented Oct 24, 2025

Uh oh!

clumsy commented Oct 24, 2025

Uh oh!

clumsy commented Oct 24, 2025

Uh oh!

clumsy commented Oct 24, 2025

Uh oh!

kiukchung commented Oct 24, 2025

Uh oh!

clumsy commented Oct 24, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

codecov-commenter commented Oct 23, 2025 •

edited

Loading

kiukchung commented Oct 23, 2025 •

edited

Loading