Skip to content

Conversation

everpeace
Copy link

@everpeace everpeace commented Jan 17, 2025

Purpose of this PR

This PR implements Suspend/Resume operation on SparkApplication.

Proposed changes:

  • Add Suspend: true|false field in SparkApplicationSpec
  • State Transition:
    • Suspend: ** --Suspend: true--> SUSPENDING --> SUSPENDED (**=any non-Terminal Application State)
    • Resume: SUSPENDING | SUSPENDED --Suspend: false--> RESUMING --> SUBMITTED

Fixes #2290

Change Category

  • Bugfix (non-breaking change which fixes an issue)
  • Feature (non-breaking change which adds functionality)
  • Breaking change (fix or feature that could affect existing functionality)
  • Documentation update

Rationale

Supporting Suspend/Resume feature could make Kueue integration easier.

Checklist

  • I have conducted a self-review of my own code.
  • I have updated documentation accordingly.
  • I have added tests that prove my changes are effective or that my feature works.
  • Existing unit tests pass locally with my changes.

Additional Notes

@Kevinz857
Copy link
Contributor

@everpeace Maybe u can resolve this conflict so that this PR can continue to move forward?

we also think about use kueue with sparkoperator , depend this suspend state feature . It would be better if you could push further down ~

@everpeace
Copy link
Author

@Kevinz857 Hi, thanks for your comment!! I rebaed my PR.

@everpeace
Copy link
Author

we also think about use kueue with sparkoperator , depend this suspend state feature . It would be better if you could push further down ~

My original SparkApplication integration PR in Kueue(kubernetes-sigs/kueue#4032) was actually closed because

However, I already opened a PR for this which will be needed to support SparkApplication integration in Kueue: kubernetes-sigs/kueue#4102.

@ChenYi015
Copy link
Member

/assign @jacobsalway @vara-bonthu @ImpSy

@ChenYi015
Copy link
Member

@Kevinz857 Would you mind help reviewing the PR?

@Kevinz857
Copy link
Contributor

@Kevinz857 Would you mind help reviewing the PR?

@ChenYi015 Very happy, I will review this PR together in detail later.

@Kevinz857
Copy link
Contributor

@everpeace Suggestions for Improvement

  1. Controller logic abstraction
    Consider extracting the suspension check (if app.Spec.Suspend != nil && *app.Spec.Suspend) into a helper function like isSuspended(app) for clarity and potential reuse.
    Also, using a named constant for the event string ("SparkApplicationSuspended") would improve consistency and readability.

  2. Pod termination check reusability
    The logic that checks for Spark driver/executor pod termination before resuming could be encapsulated in a helper function (e.g. areSparkPodsTerminated(app)) to make the code more modular and easier to test.
    Optionally, adding retry limits or exponential backoff can help avoid indefinite retries in corner cases.

  3. Field documentation enhancement
    In sparkapplication_types.go, the Suspend field comment could be more descriptive, such as:

// Suspend indicates whether the SparkApplication should be suspended.
// When true, the controller skips submitting the Spark job.

  1. Test enhancement ideas

Add unit tests for the suspension logic to ensure the controller skips job submission when .spec.suspend is true.
Consider validating whether the correct Kubernetes events (Suspended/Resumed) are emitted using a fake event recorder.

@ChenYi015 If have time, we can take a look together to see if these suggestions are suitable

@everpeace Or if there are any better suggestions, we can discuss them together

Copy link
Contributor

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by:
Once this PR has been reviewed and has the lgtm label, please ask for approval from jacobsalway. For more information see the Kubernetes Code Review Process.

The full list of commands accepted by this bot can be found here.

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@everpeace
Copy link
Author

everpeace commented Apr 30, 2025

@Kevinz857 Thanks for your review.

Consider extracting the suspension check (if app.Spec.Suspend != nil && *app.Spec.Suspend) into a helper function like isSuspended(app) for clarity and potential reuse.

hmmm. SparkApplicationSpec.Suspend is not a pointer. So, I can't see any motivation for it.

Suspend bool `json:"suspend"`

Perhaps, you saw ScheduledSparkApplication's suspension check:

if scheduledApp.Spec.Suspend != nil && *scheduledApp.Spec.Suspend {
return ctrl.Result{}, nil
}

Could you elaborate on your suggestion if I missed something?

Also, using a named constant for the event string ("SparkApplicationSuspended") would improve consistency and readability.

hmmm. I think I already did.

EventSparkApplicationSuspending = "SparkApplicationSuspending"
EventSparkApplicationSuspended = "SparkApplicationSuspended"
EventSparkApplicationResuming = "SparkApplicationResuming"
)

Could you elaborate on your suggestion if I missed something?

Field documentation enhancement
In sparkapplication_types.go, the Suspend field comment could be more descriptive, such as:

Thanks. Fixed in b3f8e8c

Test enhancement ideas
Add unit tests for the suspension logic to ensure the controller skips job submission when .spec.suspend is true.

I think it's already covered:

When("reconciling a new SparkApplication with Suspend=True", func() {
BeforeEach(func() {
app.Spec.Suspend = true
Expect(k8sClient.Create(ctx, app)).To(Succeed())
})
AfterEach(func() {

Could you elaborate on your suggestion if I missed something?

Consider validating whether the correct Kubernetes events (Suspended/Resumed) are emitted using a fake event recorder.

IIUC, the current SparkApplication unit tests(internal/controller/sparkapplication/controller_test.go) do not assert emitted events at all. Although I fully agree your suggestion(we should assert emitted events in unit tests), I think the enhancement should be done in another PR so that all the unit tests assert emitted events. Or, should this PR cover event assertions only for suspend/resume in this PR?? If so, I would be happy to handle it.

Copy link

This pull request has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.

@andreyvelich
Copy link
Member

/remove-lifecycle stale
@vara-bonthu @ChenYi015 @nabuskey I think, that would be a great feature in Spark Operator APIs to natively integrate it with Kueue!

cc @kannon92 @tenzen-y @mimowo

@mimowo
Copy link

mimowo commented Sep 5, 2025

@vara-bonthu @ChenYi015 @nabuskey I think, that would be a great feature in Spark Operator APIs to natively integrate it with Kueue!

Awesome effort! We had multiple people asking for the integration, so I'm happy to see it rolling.

cc @kannon92 @tenzen-y @mimowo

Overall LGTM, let me comment here on the API itself, and I may also leave some small-ish comments to the code.

hmmm. SparkApplicationSpec.Suspend is not a pointer. So, I can't see any motivation for it.

I don't remember / wasn't involved with the decision when introducing the field to core k8s as a ptr.

I found this question from @soltysh in the k8s design, but there was no explicit answer: kubernetes/enhancements#2234 (comment), just the confirmation of the decision to use ptr in the comment.

Without knowing exact details, I think this is a general practice in k8s APIs to introduce new fields as pointers. I think it is because of the client-go compatibility. So, old API clients may fail to parse the API output after the field is introduced, even if it is not used in the cluster. Users may want to use the old clients because technically the API remains the v1beta2 version.

Maybe @deads2k could validate that statement to keep me honest.

@mimowo
Copy link

mimowo commented Sep 5, 2025

@everpeace please rebase

…cationStates(Suspending/Suspended/Resuming)

Signed-off-by: Shingo Omura <[email protected]>
@everpeace everpeace force-pushed the feature/suspend branch 2 times, most recently from 40e300d to f6e439d Compare September 5, 2025 10:40

r.recordSparkApplicationEvent(app)

_ = r.submitSparkApplication(ctx, app)
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Don't you need to fail on error here and retry?

If the handling of error is not necessary, it is probably worth adding a comment why, because it is far from obvious.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good catch, I noticed this pattern is used in several places in the controller. Let me inspect this.

Copy link
Author

@everpeace everpeace Sep 5, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The returned error of r.submitSparkApplication was not used anywhere in the controller. Instead, the submission result(success/failure) was recorded in app.Status. So I just removed the returned error from the method. fixed in f4719c1

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see, btw, my question was mostly exploratory, so if the repo maintainers prefer the old style, then I'm also fine with that.

@everpeace
Copy link
Author

Without knowing exact details, I think this is a general practice in k8s APIs to introduce new fields as pointers. I think it is because of the client-go compatibility. So, old API clients may fail to parse the API output after the field is introduced, even if it is not used in the cluster. Users may want to use the old clients because technically the API remains the v1beta2 version.

It makes sense. I changed suspend field as pointer in 43bcea2.

@everpeace
Copy link
Author

@mimowo all your comments are addressed. PTAL 🙇

@mimowo
Copy link

mimowo commented Sep 5, 2025

LGTM, thank you 👍

@soltysh
Copy link

soltysh commented Sep 5, 2025

I found this question from @soltysh in the k8s design, but there was no explicit answer: kubernetes/enhancements#2234 (comment), just the confirmation of the decision to use ptr in the comment.

Without knowing exact details, I think this is a general practice in k8s APIs to introduce new fields as pointers. I think it is because of the client-go compatibility. So, old API clients may fail to parse the API output after the field is introduced, even if it is not used in the cluster. Users may want to use the old clients because technically the API remains the v1beta2 version.

Maybe @deads2k could validate that statement to keep me honest.

Just to clarify, since my name was called out the reason for going with *bool is two folded:

  1. The ability to differentiate from the value being set and not.
  2. Most importantly, the k8s API guidelines force all optional fields to be pointers.

Hope that helps.

@mimowo
Copy link

mimowo commented Sep 11, 2025

LGTM, @andreyvelich what are the next steps for merging / releasing this enhancement?

@andreyvelich
Copy link
Member

/assign @nabuskey @vara-bonthu @ChenYi015 @yuchaoran2011 Please can you help with reviewing this PR?

We see a lot of interest of native integration between Kueue and SparkApplication CRD.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Add support for suspend
9 participants