Skip to content

Conversation

@clumsy
Copy link
Collaborator

@clumsy clumsy commented Oct 21, 2025

Added a new validate_spec option (defaults to False for now, but perhaps can be turned on by default later) for kubernetes scheduler

Test plan:
[x] added unit tests

@meta-cla meta-cla bot added the CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed. label Oct 21, 2025
@clumsy
Copy link
Collaborator Author

clumsy commented Oct 21, 2025

Please take a look @kiukchung @d4l3k @andywag @tonykao8080

Copy link
Contributor

@kiukchung kiukchung left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM overall, thanks for adding this! If there is no downside (e.g. validation takes too long) to enabling this by default, then we should.

@codecov-commenter
Copy link

codecov-commenter commented Oct 23, 2025

Codecov Report

❌ Patch coverage is 95.00000% with 1 line in your changes missing coverage. Please review.
✅ Project coverage is 91.58%. Comparing base (5957532) to head (0ada93d).

Files with missing lines Patch % Lines
torchx/schedulers/kubernetes_scheduler.py 95.00% 1 Missing ⚠️
Additional details and impacted files
@@           Coverage Diff           @@
##             main    #1153   +/-   ##
=======================================
  Coverage   91.57%   91.58%           
=======================================
  Files          83       83           
  Lines        6599     6617   +18     
=======================================
+ Hits         6043     6060   +17     
- Misses        556      557    +1     
Flag Coverage Δ
unittests 91.58% <95.00%> (+<0.01%) ⬆️

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

@kiukchung
Copy link
Contributor

@azzhipa can you fix this type error:

torchx/schedulers/kubernetes_scheduler.py:681:23 Undefined attribute [16]: `object` has no attribute `__getitem__`.

@clumsy clumsy force-pushed the feat/k8s_validate_pod_spec branch from e7ecaec to 1d663e8 Compare October 23, 2025 19:45
@clumsy
Copy link
Collaborator Author

clumsy commented Oct 23, 2025

Done, @kiukchung!
Set the default value to True.

Once we remove runopts this check can stay - it should be tested extensively by the time we remove these runopts.

@kiukchung
Copy link
Contributor

kiukchung commented Oct 23, 2025

Done, @kiukchung! Set the default value to True.

Once we remove runopts this check can stay - it should be tested extensively by the time we remove these runopts.

sounds good,

more pyre fixes:

torchx/schedulers/test/kubernetes_scheduler_test.py:114:16 Unused ignore [0]: The `pyre-ignore[16]` or `pyre-fixme[16]` comment is not suppressing type errors, please remove it.
torchx/schedulers/test/kubernetes_scheduler_test.py:138:12 Unused ignore [0]: The `pyre-ignore[16]` or `pyre-fixme[16]` comment is not suppressing type errors, please remove it.
torchx/schedulers/test/kubernetes_scheduler_test.py:520:56 Unused ignore [0]: The `pyre-ignore[16]` or `pyre-fixme[16]` comment is not suppressing type errors, please remove it.
torchx/schedulers/test/kubernetes_scheduler_test.py:929:51 Unused ignore [0]: The `pyre-ignore[16]` or `pyre-fixme[16]` comment is not suppressing type errors, please remove it.

I think the macos unittest failure suggests that the validation code tries to hit a public URL? If this is the case then we'd have to make sure we turn off validation for unittests and add a mock validator for the test-case that asserts the validation codepath.

@clumsy clumsy force-pushed the feat/k8s_validate_pod_spec branch from 1d663e8 to 40b89f2 Compare October 24, 2025 13:16
@clumsy
Copy link
Collaborator Author

clumsy commented Oct 24, 2025

Yes, weird locally do I get these false-positives (probably not installing the package in editable more and picking old version):

torchx/schedulers/test/kubernetes_scheduler_test.py:114:16 Undefined attribute [16]: `object` has no attribute `__getitem__`.
torchx/schedulers/test/kubernetes_scheduler_test.py:137:12 Undefined attribute [16]: `object` has no attribute `__getitem__`.
torchx/schedulers/test/kubernetes_scheduler_test.py:518:16 Undefined attribute [16]: `object` has no attribute `__getitem__`.
torchx/schedulers/test/kubernetes_scheduler_test.py:927:24 Undefined attribute [16]: `object` has no attribute `__getitem__`.

Rebased and removed these for the remote check to pass, @kiukchung

Let me know if you're interested in migrating to ty. Although experimental it worked quite solid for me.

@kiukchung
Copy link
Contributor

Yes, weird locally do I get these false-positives (probably not installing the package in editable more and picking old version):

torchx/schedulers/test/kubernetes_scheduler_test.py:114:16 Undefined attribute [16]: `object` has no attribute `__getitem__`.
torchx/schedulers/test/kubernetes_scheduler_test.py:137:12 Undefined attribute [16]: `object` has no attribute `__getitem__`.
torchx/schedulers/test/kubernetes_scheduler_test.py:518:16 Undefined attribute [16]: `object` has no attribute `__getitem__`.
torchx/schedulers/test/kubernetes_scheduler_test.py:927:24 Undefined attribute [16]: `object` has no attribute `__getitem__`.

Rebased and removed these for the remote check to pass, @kiukchung

Let me know if you're interested in migrating to ty. Although experimental it worked quite solid for me.

unfortuately we need to keep the type checker in sync with the one we use internally (which pyre) :(

@kiukchung
Copy link
Contributor

@clumsy I'm going to figure out a way to give you access to kick off the CI workflow so that you're not blocked on me to validate that CI passes.

@clumsy
Copy link
Collaborator Author

clumsy commented Oct 24, 2025

As I was verifying a new version - I had an idea, @kiukchung.
Does it make sense to move this check to torchx run --dryrun path? We do call create_namespaced_custom_object with dryrun=All after all.

@clumsy
Copy link
Collaborator Author

clumsy commented Oct 24, 2025

Not sure if this violates the contract: https://meta-pytorch.org/torchx/main/runner.html#torchx.runner.Runner.dryrun in true sense.

For now I'll just fix tests, let me know your preference please, @kiukchung

@clumsy clumsy force-pushed the feat/k8s_validate_pod_spec branch from 40b89f2 to c1a6a40 Compare October 24, 2025 22:04
@clumsy
Copy link
Collaborator Author

clumsy commented Oct 24, 2025

Done, this should pass the linter, pyre and tests now, @kiukchung
I think I still like an actual k8s dry_run: All (with a remote call) during actual submission (vs. trochx run --dryrun) as in the original proposal, but now enabled by default (per you suggestion) which I think is safe - since we make the exact same call later w/o dry_run option anyway.

@kiukchung
Copy link
Contributor

@clumsy I'm going to figure out a way to give you access to kick off the CI workflow so that you're not blocked on me to validate that CI passes.

@clumsy just sent you an invite (check the email associated with your github account) once you accept you should be able to kick off the CI workflow.

@clumsy
Copy link
Collaborator Author

clumsy commented Oct 24, 2025

Thanks, @kiukchung - I can see all the controls now, will use next time to finalize the change!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants