Skip to content

add DRAKWOKDriver dependency to install kwok and dra resources for simulated scale testing #3491

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 1 commit into
base: master
Choose a base branch
from

Conversation

alaypatel07
Copy link
Contributor

@alaypatel07 alaypatel07 commented Aug 6, 2025

What type of PR is this?

/kind feature

What this PR does / why we need it:

This PR adds a DRAKWOKDriver as a dependency. This dependency will:

  1. install kwok
  2. install the KWOK stage CRs to reconcile nodes, jobs, pods
  3. create fake DeviceClass and ResourceSlices
  4. Once this is done, the test can start creating workloads to for the scale test.

Which issue(s) this PR fixes:

Fixes #

Special notes for your reviewer:

@k8s-ci-robot
Copy link
Contributor

Skipping CI for Draft Pull Request.
If you want CI signal for your change, please convert it to an actual PR.
You can still manually trigger a test run with /test all

@k8s-ci-robot k8s-ci-robot added do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. labels Aug 6, 2025
@alaypatel07 alaypatel07 changed the title add kwok/dra dependency to install kwok and dra resources for simulat… WIP: add kwok/dra dependency to install kwok and dra resources for simulat… Aug 6, 2025
@k8s-ci-robot k8s-ci-robot requested review from mborsz and wojtek-t August 6, 2025 22:12
@k8s-ci-robot
Copy link
Contributor

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by: alaypatel07
Once this PR has been reviewed and has the lgtm label, please assign mborsz for approval. For more information see the Code Review Process.

The full list of commands accepted by this bot can be found here.

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@k8s-ci-robot k8s-ci-robot added the size/XXL Denotes a PR that changes 1000+ lines, ignoring generated files. label Aug 6, 2025
@alaypatel07 alaypatel07 changed the title WIP: add kwok/dra dependency to install kwok and dra resources for simulat… WIP: add kwok/dra dependency to install kwok and dra resources for simulated scale testing Aug 6, 2025
@alaypatel07 alaypatel07 force-pushed the dra-kwok-dependency branch 2 times, most recently from 7456406 to 9fb09f5 Compare August 7, 2025 11:28
@alaypatel07
Copy link
Contributor Author

alaypatel07 commented Aug 7, 2025

Ran into a lot of issues in make this work:

  1. the helm template create generates yaml files that have templates embedded for kwok. The cl2 template parser was thinking it is template for CL2, so had to escape all of those. Cursor with claude-4-sonnet helped a lot in removing these errors.
  2. two templates, kwok.yaml and deployment.yaml were skipping namespaces when they were parsed by cl2 template parser, had to implement custom parser to workaround this error edit: debugged

TODO:

  1. the test is leaving behind resources, need to clean it up edit: done

But apart from all of this, using README.md and trying examples directory works:

# ./run-e2e.sh cluster-loader2   --testconfig=pkg/dependency/kwok/examples/test-config.yaml   --provider=kind   --nodes=1   --report-dir=/tmp/kwok-test

0807 07:29:12.286077 2105173 kwok.go:178] kwok-dra: waiting for 3 nodes to be ready
I0807 07:29:12.287199 2105173 kwok.go:183] kwok-dra: KWOK controller installed successfully
I0807 07:29:12.287220 2105173 simple_test_executor.go:82] All dependencies setup successfully
I0807 07:29:12.287228 2105173 simple_test_executor.go:162] Step "[step: 01] Create GPU ResourceClaimTemplate" started
I0807 07:29:12.488118 2105173 simple_test_executor.go:183] Step "[step: 01] Create GPU ResourceClaimTemplate" ended
I0807 07:29:12.488165 2105173 simple_test_executor.go:162] Step "[step: 02] Create KWOK GPU jobs" started
I0807 07:29:14.494262 2105173 simple_test_executor.go:183] Step "[step: 02] Create KWOK GPU jobs" ended
I0807 07:29:29.519537 2105173 simple_test_executor.go:411] Resources cleanup time: 15.02523962s
I0807 07:29:29.519553 2105173 simple_test_executor.go:414] Tearing down dependencies
I0807 07:29:29.520029 2105173 kwok.go:188] kwok-dra: tearing down KWOK controller
I0807 07:29:44.540710 2105173 kwok.go:198] kwok-dra: KWOK controller removed
I0807 07:29:44.540728 2105173 simple_test_executor.go:418] All dependencies torn down successfully, cleanup time: 15.021173561s
I0807 07:29:44.540738 2105173 clusterloader.go:254] --------------------------------------------------------------------------------
I0807 07:29:44.540743 2105173 clusterloader.go:255] Test Finished
I0807 07:29:44.540747 2105173 clusterloader.go:256]   Test: pkg/dependency/kwok/examples/test-config.yaml
I0807 07:29:44.540752 2105173 clusterloader.go:257]   Status: Success
I0807 07:29:44.540757 2105173 clusterloader.go:261] --------------------------------------------------------------------------------

JUnit report was created: /tmp/kwok-test/junit.xml
I0807 07:29:44.540869 2105173 prometheus.go:331] Get snapshot from Prometheus
I0807 07:29:44.540876 2105173 exec_service.go:130] Exec service: tearing down service

@alaypatel07 alaypatel07 force-pushed the dra-kwok-dependency branch from 9fb09f5 to 21adf6e Compare August 7, 2025 16:30
@alaypatel07 alaypatel07 marked this pull request as ready for review August 7, 2025 16:31
@alaypatel07 alaypatel07 force-pushed the dra-kwok-dependency branch from 3d5efe8 to 30224a9 Compare August 7, 2025 16:37
@alaypatel07 alaypatel07 changed the title WIP: add kwok/dra dependency to install kwok and dra resources for simulated scale testing add DRAKWOKDriver dependency to install kwok and dra resources for simulated scale testing Aug 7, 2025
@k8s-ci-robot k8s-ci-robot removed the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Aug 7, 2025
@alaypatel07 alaypatel07 force-pushed the dra-kwok-dependency branch from 30224a9 to 2a4a458 Compare August 7, 2025 16:39
@alaypatel07 alaypatel07 force-pushed the dra-kwok-dependency branch from 2a4a458 to fb7f278 Compare August 7, 2025 16:43
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. size/XXL Denotes a PR that changes 1000+ lines, ignoring generated files.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants