Skip to content

Conversation

@clumsy
Copy link
Contributor

@clumsy clumsy commented Oct 17, 2025

When job is not found we will show State=UNKNOWN and surface the error in Msg

Test plan:
[x] added unit test
[x] torchx status sfai_kubernetes://torchx/real-namespace:bogus-job-id

AppStatus:
    State: UNKNOWN
    Num Restarts: -1
    Roles: 
    Msg: Not Found: {"kind":"Status","apiVersion":"v1","metadata":{},"status":"Failure","message":"jobs.batch.volcano.sh \"bogus-jobs-id\" not found","reason":"NotFound","details":{"name":"bogus-job-id","group":"batch.volcano.sh","kind":"jobs"},"code":404}

    Structured Error Msg: <NONE>
    UI URL: None

@meta-cla meta-cla bot added the CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed. label Oct 17, 2025
@clumsy
Copy link
Contributor Author

clumsy commented Oct 17, 2025

For your consideration @kiukchung @d4l3k
Looks like we might need the same for other APIs

@clumsy
Copy link
Contributor Author

clumsy commented Oct 17, 2025

setup_pyre.sh lacked +x so I added it here, don't think it justifies a full PR

@kiukchung
Copy link
Contributor

when the job is not found describe API should return None no?

@clumsy
Copy link
Contributor Author

clumsy commented Oct 20, 2025

Do you mean State: None? What would you like it to show @kiukchung ? I'm happy to update my fix as long as it's not an exception.

@kiukchung
Copy link
Contributor

Do you mean State: None? What would you like it to show @kiukchung ? I'm happy to update my fix as long as it's not an exception.

I mean that if volcano raises an exception when the job does not exist, then torchx's describe() API should return None as per its documentation: https://meta-pytorch.org/torchx/latest/runner.html

describe(app_handle: str)) → Optional[AppDef]
  """
   Reconstructs the application (to the best extent) given the app handle. Note that the reconstructed application may not be the complete app as it was submitted via the run API. How much of the app can be reconstructed is scheduler dependent.

    Returns:

        AppDef or None if the app does not exist anymore or if the scheduler does not support describing the app handle
  """

@codecov-commenter
Copy link

codecov-commenter commented Oct 23, 2025

Codecov Report

✅ All modified and coverable lines are covered by tests.
✅ Project coverage is 90.27%. Comparing base (79e14da) to head (61107b6).
⚠️ Report is 4 commits behind head on main.

Additional details and impacted files
@@            Coverage Diff             @@
##             main    #1151      +/-   ##
==========================================
- Coverage   91.65%   90.27%   -1.39%     
==========================================
  Files          83       83              
  Lines        6483     6599     +116     
==========================================
+ Hits         5942     5957      +15     
- Misses        541      642     +101     
Flag Coverage Δ
unittests 90.27% <100.00%> (-1.39%) ⬇️

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

@clumsy
Copy link
Contributor Author

clumsy commented Oct 23, 2025

Oh, that's a good point, @kiukchung

Let me update this PR then!

@clumsy clumsy force-pushed the fix/k8s_job_not_found branch from bad0ca7 to 09d550e Compare October 23, 2025 16:41
@clumsy
Copy link
Contributor Author

clumsy commented Oct 23, 2025

This looks perfect, @kiukchung!
torchx status kubernetes://torchx/real-namespace:bla
produces

torchx 2025-10-23 16:39:56 ERROR    AppDef: real-namespace:bla, does not exist or has been removed from kubernetes's data plane

@kiukchung
Copy link
Contributor

This looks perfect, @kiukchung! torchx status kubernetes://torchx/real-namespace:bla produces

torchx 2025-10-23 16:39:56 ERROR    AppDef: real-namespace:bla, does not exist or has been removed from kubernetes's data plane

cool! nit: can we log the caught exception for debugging purposes?

try:
  ...
except Exception as e:
   # is there a narrower exception we can catch? or make sure this is actually a "app not found"
   if not is_app_not_found_error(e):
     logger.exception("error when querying kubernetes/volcano for the job's status")
   return None

or better just let other errors throw

try:
  ...
except Exception as e:
   if is_app_not_found_error(e):
     return None
   raise # for any other exceptione, re-raise

@clumsy clumsy force-pushed the fix/k8s_job_not_found branch from 09d550e to 61107b6 Compare October 23, 2025 17:48
@clumsy
Copy link
Contributor Author

clumsy commented Oct 23, 2025

Done, @kiukchung

I agree it's cleaner

@kiukchung
Copy link
Contributor

@clumsy can you rebase on top of main?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants