Skip to content

Commit 2a4a458

Browse files
committed
add kwok/dra dependency to install kwok and dra resources for simulate scale testing
1 parent 48b6c15 commit 2a4a458

31 files changed

+3664
-0
lines changed

clusterloader2/cmd/clusterloader.go

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -44,6 +44,7 @@ import (
4444
"k8s.io/perf-tests/clusterloader2/pkg/util"
4545

4646
_ "k8s.io/perf-tests/clusterloader2/pkg/dependency/dra"
47+
_ "k8s.io/perf-tests/clusterloader2/pkg/dependency/kwok/dra"
4748
_ "k8s.io/perf-tests/clusterloader2/pkg/measurement/common"
4849
_ "k8s.io/perf-tests/clusterloader2/pkg/measurement/common/bundle"
4950
_ "k8s.io/perf-tests/clusterloader2/pkg/measurement/common/dns"
Lines changed: 197 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,197 @@
1+
# KWOK DRA Dependency
2+
3+
This dependency provides fake Kubernetes nodes with Dynamic Resource Allocation (DRA) GPU resources using [KWOK (Kubernetes WithOut Kubelet)](https://kwok.sigs.k8s.io/).
4+
5+
## What it does
6+
7+
- Installs KWOK controller in `kwok-system` namespace
8+
- Creates fake nodes with GPU resources exposed through DRA ResourceSlices
9+
- Enables testing DRA workloads without real GPU hardware
10+
11+
## Configuration
12+
13+
Add the dependency to your ClusterLoader2 test configuration:
14+
15+
```yaml
16+
# In your CL2 test config
17+
dependencyConfigs:
18+
- name: kwok-dra
19+
params:
20+
nodes: 4 # Number of fake nodes (default: 2)
21+
gpusPerNode: 16 # GPUs per node (default: 8)
22+
timeout: "10m" # Setup timeout (default: 5m)
23+
```
24+
25+
## Fake Resources Created
26+
27+
### Nodes
28+
- **Name**: `kwok-node-0`, `kwok-node-1`, etc.
29+
- **Resources**: 32 CPU, 256Gi memory, 110 pods
30+
- **Labels**: `type=kwok`, `kubernetes.io/hostname=kwok-node-N`
31+
- **Taints**: `kwok.x-k8s.io/node=fake:NoSchedule` (prevents real workloads)
32+
33+
### GPU Resources (DRA)
34+
- **Driver**: `cl2-gpu.kwok.x-k8s.io`
35+
- **API Version**: `resource.k8s.io/v1beta2`
36+
- **ResourceSlices**: One per node with configurable GPU devices
37+
- **Device Names**: `gpu0`, `gpu1`, `gpu2`, etc.
38+
- **Capacity**: Each device provides `1` GPU unit (`cl2-gpu.kwok.x-k8s.io/gpu`)
39+
- **Device Attributes**:
40+
- `gpu-type`: "kwok-gpu"
41+
- `memory`: "8Gi"
42+
- `compute-capability`: "7.5"
43+
44+
## Example: Simple GPU Job
45+
46+
Create a job that requests fake GPUs using the provided example files:
47+
48+
### 1. GPU Job Template
49+
See [`examples/kwok-gpu-job.yaml`](examples/kwok-gpu-job.yaml) - A ClusterLoader2 job template that:
50+
- Uses the `job-type: short-lived` labels for KWOK completion simulation
51+
- Includes proper tolerations for KWOK fake nodes
52+
- References the GPU ResourceClaimTemplate
53+
- Uses templating variables (`{{.Name}}`, `{{.Replicas}}`, etc.)
54+
55+
### 2. ResourceClaimTemplate
56+
See [`examples/kwok-gpu-resource-claim-template.yaml`](examples/kwok-gpu-resource-claim-template.yaml) - Defines:
57+
- `v1beta2` ResourceClaimTemplate for requesting GPU devices
58+
- References the `cl2-gpu.kwok.x-k8s.io` DeviceClass
59+
- Created first in a separate step before jobs are created
60+
61+
## Running Tests with KWOK DRA
62+
63+
### Using the Main E2E Script
64+
65+
The `run-e2e.sh` script is the main entry point for running performance tests in the perf-tests repository.
66+
67+
```bash
68+
# Basic usage from perf-tests root directory
69+
./run-e2e.sh <tool-name> [options...]
70+
71+
# Run a ClusterLoader2 test with KWOK DRA dependency
72+
./run-e2e.sh cluster-loader2 \
73+
--testconfig=pkg/dependency/kwok/examples/test-config.yaml \
74+
--provider=skeleton \
75+
--nodes=3 \
76+
--report-dir=/tmp/reports
77+
78+
# Quick test with different node counts
79+
./run-e2e.sh cluster-loader2 \
80+
--testconfig=pkg/dependency/kwok/examples/test-config.yaml \
81+
--provider=skeleton \
82+
--nodes=5 \
83+
--report-dir=/tmp/reports
84+
85+
# View available test tools
86+
./run-e2e.sh --help
87+
```
88+
89+
### Quick Start
90+
91+
1. **Prerequisites**: Ensure you have a Kubernetes cluster running
92+
2. **Environment**: Set `KUBECONFIG` or `~/.kube/config` pointing to your cluster
93+
3. **Run Test**: Execute the script with desired parameters
94+
4. **Results**: Check the `--report-dir` for test results and metrics
95+
96+
### Available Test Tools
97+
98+
The `run-e2e.sh` script supports multiple performance testing tools:
99+
100+
- **`cluster-loader2`** - Kubernetes cluster performance and scale testing
101+
- **`network-performance`** - Network performance benchmarks
102+
- **`kube-dns`** - DNS performance testing
103+
- **`core-dns`** - CoreDNS performance testing
104+
- **`node-local-dns`** - NodeLocalDNS performance testing
105+
106+
### Example ClusterLoader2 Test Config
107+
108+
Use the provided test configuration file:
109+
110+
```bash
111+
# Copy the example config to your test directory
112+
cp pkg/dependency/kwok/examples/test-config.yaml your-test-config.yaml
113+
114+
# Or reference it directly
115+
./run-e2e.sh cluster-loader2 \
116+
--testconfig=pkg/dependency/kwok/examples/test-config.yaml \
117+
--provider=kind \
118+
--nodes=3 \
119+
--report-dir=/tmp/kwok-dra-test
120+
```
121+
122+
The [`examples/test-config.yaml`](examples/test-config.yaml) includes:
123+
- **KWOK DRA dependency** with 3 nodes and 8 GPUs per node
124+
- **ResourceClaimTemplate creation** step (must run before jobs)
125+
- **10 GPU jobs** requesting fake GPU resources
126+
- **QPS throttling** for controlled job creation
127+
128+
## Job Timing Configuration
129+
130+
To control how long simulated jobs run before completing:
131+
132+
```bash
133+
# Set job duration to 10 seconds (10000ms)
134+
export CL2_JOB_RUNNING_TIME_MS=10000
135+
136+
# Run your ClusterLoader2 test
137+
./clusterloader2 --testconfig=test-config.yaml
138+
```
139+
140+
This affects all jobs with `job-type: short-lived` labels running on KWOK nodes.
141+
142+
## Important Notes
143+
144+
1. **Tolerations Required**: Jobs must tolerate the `kwok.x-k8s.io/node=fake:NoSchedule` taint
145+
2. **DeviceClass**: Uses the built-in `cl2-gpu.kwok.x-k8s.io` DeviceClass provided by the dependency
146+
3. **Step Ordering**: ResourceClaimTemplate must be created before jobs that reference it
147+
4. **Resource Dependencies**: Jobs depend on both DeviceClass (from dependency) and ResourceClaimTemplate (from first step)
148+
5. **Job Labels**: Pods must have `job-type: short-lived` label for KWOK job completion simulation
149+
6. **Job Completion**: Set `CL2_JOB_RUNNING_TIME_MS` environment variable to control simulated job duration (default: 30000ms)
150+
7. **Device Attributes**: v1beta2 API provides rich device metadata (GPU type, memory, compute capability)
151+
8. **Enhanced Scheduling**: ResourceSlices include proper labels for improved resource discovery
152+
9. **Fake Resources**: GPUs are simulated - no actual GPU operations occur
153+
10. **Cleanup**: The dependency automatically cleans up when tests complete
154+
155+
## Troubleshooting
156+
157+
### Common Issues
158+
159+
- **Nodes not ready**: Check KWOK controller logs in `kwok-system` namespace
160+
- **Jobs not scheduling**: Verify tolerations and DeviceClass configuration
161+
- **Timeout errors**: Increase the `timeout` parameter in dependency config
162+
- **ResourceClaimTemplate not found**: Ensure the ResourceClaimTemplate step runs before job creation
163+
164+
### Testing the Setup
165+
166+
```bash
167+
# Test KWOK DRA dependency with example config
168+
./run-e2e.sh cluster-loader2 \
169+
--testconfig=clusterloader2/pkg/dependency/kwok/examples/test-config.yaml \
170+
--provider=kind \
171+
--nodes=3 \
172+
--report-dir=/tmp/kwok-test
173+
174+
# Check KWOK nodes are created
175+
kubectl get nodes -l type=kwok
176+
177+
# Verify ResourceSlices are available
178+
kubectl get resourceslices
179+
180+
# Check DeviceClass is installed
181+
kubectl get deviceclasses cl2-gpu.kwok.x-k8s.io
182+
```
183+
184+
### Debug Mode
185+
186+
Enable verbose logging by setting environment variables:
187+
188+
```bash
189+
export KLOG_V=2
190+
./run-e2e.sh cluster-loader2 --testconfig=... --v=2
191+
```
192+
193+
## See Also
194+
195+
- [KWOK Documentation](https://kwok.sigs.k8s.io/)
196+
- [Kubernetes DRA Documentation](https://kubernetes.io/docs/concepts/scheduling-eviction/dynamic-resource-allocation/)
197+
- [ClusterLoader2 Documentation](../../../docs/)

0 commit comments

Comments
 (0)