-
Notifications
You must be signed in to change notification settings - Fork 1.4k
feat: add rest api server proposal #2517
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Open
aagumin
wants to merge
3
commits into
kubeflow:master
Choose a base branch
from
aagumin:proposals/000-rest-submit
base: master
Could not load branches
Branch not found: {{ refName }}
Loading
Could not load tags
Nothing to show
Loading
Are you sure you want to change the base?
Some commits from the old base branch may be removed from the timeline,
and old review comments may become outdated.
Open
Changes from all commits
Commits
Show all changes
3 commits
Select commit
Hold shift + click to select a range
File filter
Filter by extension
Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
There are no files selected for viewing
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,12 @@ | ||
| @startuml | ||
| actor User | ||
| participant "REST Submit Server" as REST | ||
| participant "Kubernetes API" as K8s | ||
| participant "Spark Operator" as Operator | ||
|
|
||
| User -> REST: POST /sparkapplications (CRD as JSON) | ||
| REST -> K8s: CreateNamespacedCustomObject (go-client) | ||
| K8s <--> Operator: Existed behaivor | ||
| K8s --> REST: Event/Status | ||
| REST --> User: Response (200 OK / Details) | ||
| @enduml |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,102 @@ | ||
| # KEP-XXXX: Add REST API Support for SparkApplication CRD | ||
|
|
||
| <!-- | ||
| A lightweight REST API proxy for SparkApplication CRDs in the Spark Operator. | ||
| --> | ||
|
|
||
| ## Summary | ||
|
|
||
| Expose a RESTful HTTP interface alongside the Spark Operator to streamline the creation, retrieval, update, listing, and deletion of `SparkApplication` Custom Resources. By bundling a minimal Go-based HTTP server that proxies JSON payloads directly to the Kubernetes API (using `client-go`), users and external systems (CI/CD pipelines, web UIs, custom dashboards) can manage Spark jobs without requiring `kubectl` or deep Kubernetes expertise. | ||
|
|
||
| ## Motivation | ||
|
|
||
| Currently, submitting Spark jobs via the Spark Operator demands crafting and applying Kubernetes manifests with `kubectl` or invoking client libraries. This creates friction for non-Kubernetes-native workflows and requires boilerplate integration code in external tools. | ||
|
|
||
| ### Goals | ||
|
|
||
| - Provide HTTP endpoints for CRUD operations on `SparkApplication` CRs. | ||
| - Allow cluster administrators to configure and integrate the authentication and authorization mechanisms of their choice. | ||
| - Package the REST proxy as a container alongside the Spark Operator in Helm charts or manifests. | ||
| - Ensure minimal resource overhead and operational complexity. | ||
|
|
||
| ### Non-Goals | ||
|
|
||
| - Replacing general-purpose CLI tools like `kubectl` for arbitrary resources. | ||
| - Implementing extensive admission logic or API aggregation capabilities beyond basic proxying. | ||
| - Managing non-Spark CRDs or core Kubernetes objects in this phase. | ||
|
|
||
| ## Proposal | ||
|
|
||
| Deploy a companion HTTP server with the Spark Operator that: | ||
|
|
||
| 1. **Listens** on a configurable port (default 8080) inside the same pod or as a sidecar. | ||
| 2. **Maps HTTP routes** to Kubernetes operations using `client-go`, operating only within a configured namespace scope: | ||
| - `POST /sparkapplications` → Create | ||
| - `GET /sparkapplications/{namespace}/{name}` → Get | ||
| - `PUT /sparkapplications/{namespace}/{name}` → Update | ||
| - `DELETE /sparkapplications/{namespace}/{name}` → Delete | ||
| - `GET /sparkapplications?namespace={ns}` → List | ||
| 3. **Accepts and returns** only JSON representations of the CRD, ensuring that manifests applied via `kubectl` or submitted via this REST API behave identically with no difference in outcomes. | ||
| 4. **Leverages in-cluster config** for authentication, mounting a namespaced ServiceAccount token bound to a `Role` (or `ClusterRole`) granting access to `sparkapplications.sparkoperator.k8s.io` within that namespace. | ||
| 5. **Supports TLS termination** via mounted certificates (cert-manager or manual). | ||
| 6. **Emits** structured logs and exposes Prometheus metrics for request counts and latencies. | ||
|
|
||
| 7.  | ||
|
|
||
| ### User Stories (Optional) | ||
|
|
||
| #### Story 1 | ||
| As a data engineer, I want to submit Spark jobs by sending a single HTTP request from my CI pipeline, so I don’t need to install or configure `kubectl` on my build agents. | ||
|
|
||
| #### Story 2 | ||
| As a platform operator, I want to integrate Spark job submission into our internal web portal using REST calls, so that users can launch jobs without learning Kubernetes details. | ||
|
|
||
| #### Story 3 | ||
| As a user without Kubernetes expertise, I want to use a familiar HTTP API to submit Spark jobs, so I don’t need direct cluster access or knowledge of `kubectl` commands. | ||
|
|
||
| ### Notes/Constraints/Caveats (Optional) | ||
|
|
||
| - This proxy does not implement Kubernetes API aggregation; it is a user-space proxy translating HTTP to Kubernetes API calls. | ||
| - All CRD validation and defaulting is still handled by the CRD’s OpenAPI schema and the Spark Operator admission logic. | ||
| - TLS and authentication configurations must be explicitly managed by the cluster administrator. | ||
|
|
||
| ### Risks and Mitigations | ||
|
|
||
| | Risk | Mitigation | | ||
| |-----------------------------------------|---------------------------------------------------------------| | ||
| | Exposed HTTP endpoint could be abused | Enforce RBAC, require ServiceAccount tokens, support TLS. | | ||
| | Additional component to maintain | Keep proxy logic minimal, reuse `client-go`, align with Operator releases. | | ||
| | Single point of failure for submissions | Deploy as a sidecar or with HA replica sets. | | ||
|
|
||
| ## Design Details | ||
|
|
||
| - **Server implementation**: Go HTTP server using Gorilla Mux or standard `net/http`, calling Kubernetes API via `client-go`. | ||
| - **Deployment**: Update the Spark Operator Helm chart to include a new Deployment (or sidecar) for the REST proxy, with ServiceAccount and RBAC definitions limited to a namespace. | ||
| - **Configuration**: Helm values for port, TLS cert paths, namespace scope filter, resource limits. | ||
|
|
||
| ### Test Plan | ||
|
|
||
| - **Unit Tests**: Mock `client-go` interactions to verify request-to-API mappings and error handling. | ||
| - **Integration Tests**: Deploy in a test cluster; execute CRUD operations via HTTP and assert correct CRD states. | ||
| - **E2E Tests**: Use the existing Spark Operator E2E framework to submit jobs via the proxy and verify job completion. | ||
|
|
||
| ## Graduation Criteria | ||
|
|
||
| - Alpha: Basic CRUD endpoints implemented, tested in one real cluster, enabled by a feature flag in Helm. | ||
| - Beta: Metrics, and documentation completed; rolling upgrades tested. | ||
| - Stable: No feature flag; production-grade documentation and test coverage ≥ 90%; promoted in Spark Operator release notes. | ||
|
|
||
| ## Implementation History | ||
|
|
||
| - 2025-04-27: KEP created (provisional). | ||
|
|
||
| ## Drawbacks | ||
|
|
||
| - Introduces an extra deployment and potential attack surface. | ||
| - May duplicate future Kubernetes API aggregation capabilities. | ||
| - Slight increase in operational complexity for cluster administrators. | ||
|
|
||
| ## Alternatives | ||
|
|
||
| - **Standalone Spark Cluster**: Deploy Spark in standalone mode, which natively includes a REST submission server, eliminating the need for an additional proxy component and leveraging Spark’s built-in submission API. | ||
|
|
||
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Uh oh!
There was an error while loading. Please reload this page.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you for preparing this proposal @aagumin!
For the Data Engineers and ML Engineers who would like to work with PySpark and interact with Spark Operator, but doesn't want to learn Kubernetes or CRDs, can't we integrate with Kubeflow SDK ?
Kubeflow SDK KEP: https://docs.google.com/document/d/1rX7ELAHRb_lvh0Y7BK1HBYAbA0zi9enB0F_358ZC58w/edit?tab=t.0
Repository: https://github.com/kubeflow/sdk
As we discussed in the proposal we can create dedicated
SparkClient()for CRUD operations, so users can quickly create their Spark Application and orchestrate them with Spark Operator without learning Kubernetes.For example, folks are already working on it as part of this work: #2422
It is a great topic to discuss at the next Spark Operator call: https://bit.ly/3VGzP4n
Would love to hear your feedback
@aagumin @Shekharrajak @lresende @shravan-achar @akshaychitneni @vara-bonthu @yuchaoran2011 @bigsur0 @jacobsalway @ChenYi015 @franciscojavierarceo @astefanutti @rimolive !
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you! This is pretty similar to what we thought and trying to use existing solutions. -
some work around & discussions :
#2422
https://www.kubeflow.org/docs/components/spark-operator/user-guide/notebooks-spark-operator/
We need to think in terms of spark app life cycle management, distributing workloads across clusters and debugging & maintenance of the long running jobs.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@andreyvelich Thank you for your feedback!
Developing an SDK is the right solution, but not a very fast one. For example, the Spark client has already been mostly implemented via the Airflow operator. It is capable of retrieving logs, handling errors and statuses. Also, management via Jupyter cannot be considered the only right way to interact, since there are DE workloads (long batch jobs and Spark streaming) that are managed via pipeline orchestrators (Prefect, Airflow, Flyte, etc). REST is a universal access interface and is understood by the majority of the IT community.
Uh oh!
There was an error while loading. Please reload this page.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We already have Kubeflow SDK, the goal of which to cover all user-facing Kubeflow APIs: https://github.com/kubeflow/sdk.
I know that @Shekharrajak has some thoughts on how we can handle CRUD of
SparkApplicationCRD as part ofSparkClient()in that SDK: https://docs.google.com/document/d/1l57bBlpxrW4gLgAGnoq9Bg7Shre7Cglv4OLCox7ER_s/edit?tab=t.0We can see two ways to integrate Spark with Jupyter:
Once we add support for 2nd, I believe that would be very simple to integrate with orchestrators like Airflow, KFP, or any other. Since Kubeflow SDK is not limited to only Jupyter, it is just an abstraction layer on top of
SparkApplicationCRD.That might be a good topic to discuss in one of our upcoming ML Experience WG calls.
I think, we should discuss what are the benefits for the additional REST API server if we already have kube API server. Platform admins should be able to simple understand Spark Application APIs to build their internal services on top of Spark Application CRD.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Okay, I understand the community’s position now. Thank you for your feedback! I think we can close this PR.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@nabuskey @vara-bonthu @jacobsalway @yuchaoran2011 @ImpSy Any thoughts on the above points?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@andreyvelich sorry)
I have another question. Are there any plans for clients in other languages? I know many companies that use Java for writing ML models. Or for example, Golang? Many companies build infrastructure management services in Go. With a REST interface, it would be possible to generate clients for different languages.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It is a good question.
We discussed it in the proposal design: https://docs.google.com/document/d/1rX7ELAHRb_lvh0Y7BK1HBYAbA0zi9enB0F_358ZC58w/edit?tab=t.0#heading=h.nu1vasakccpy
If we see strong user interest and active contributors, we can discuss potential support for other languages (e.g. Java).