Skip to content

Conversation

mtian29
Copy link

@mtian29 mtian29 commented Oct 1, 2025

Why are these changes needed?

Support for Volcano Network Topology Aware Scheduling

Close issue. #3641

Test

Build a image and use this image for kuberay operator

Apply a raycluster with labels

apiVersion: ray.io/v1
kind: RayCluster
metadata:
  name: kwok-raycluster-h100-q21-low
  labels:
    ray.io/scheduler-name: volcano 
    volcano.sh/queue-name: queue2 
    ray.io/priority-class-name: ml-tier1 
    volcano.sh/network-topology-mode: hard # <----
    volcano.sh/network-topology-highest-tier-allowed: "1" # <-----
.....

The volcano podgroup has the field injected

❯ kg pg -oyaml
apiVersion: v1
items:
- apiVersion: scheduling.volcano.sh/v1beta1
  kind: PodGroup
  metadata:
    annotations:
      volcano.sh/job-allocated-hypernode: hypernode-ad1
    creationTimestamp: "2025-09-30T22:40:12Z"
    generation: 3
    name: ray-kwok-raycluster-h100-q21-low-pg
    namespace: kuberay
.....
  spec:
    minMember: 10
    minResources:
      cpu: "80"
      memory: 80Gi
      nvidia.com/h100: "80"
    networkTopology:
      highestTierAllowed: 1 <--- 
      mode: hard <----- 

Some other combinations of labels and result.

volcano.sh/network-topology-mode: soft
volcano.sh/network-topology-highest-tier-allowed: "2"

networkTopology:
  highestTierAllowed: 2
  mode: soft


volcano.sh/network-topology-mode: hard
volcano.sh/network-topology-highest-tier-allowed: "2"

networkTopology:
  highestTierAllowed: 2
  mode: hard


volcano.sh/network-topology-mode: hard

networkTopology:
  highestTierAllowed: 1 <— default
  mode: hard

——
volcano.sh/network-topology-mode: soft

networkTopology:
  highestTierAllowed: 1 <— default
  mode: soft

——

no label => no network topology

spec:
minMember: 20
minResources:
cpu: "160"
memory: 160Gi
nvidia.com/h100: "160"
priorityClassName: ml-tier1
queue: queue2

—-

volcano.sh/network-topology-mode: soft
volcano.sh/network-topology-highest-tier-allowed: "abc"

"PodGroup.Error":"failed to convert volcano.sh/network-topology-highest-tier-allowed label to int: strconv.Atoi: parsing "abc": invalid syntax for podgroup ray-kwok-raycluster-h100-q2-soft-topology2-pg in namespace kuberay

Related issue number

Closes #3641

Checks

  • I've made sure the tests are passing.
  • Testing Strategy
    • Unit tests
    • Manual tests
    • This PR is not tested :(

@mtian29 mtian29 changed the title Support for Volcano Network Topology Aware Scheduling [Feature] Support for Volcano Network Topology Aware Scheduling Oct 1, 2025
@mtian29 mtian29 changed the title [Feature] Support for Volcano Network Topology Aware Scheduling [Feature] Support for Volcano Network Topology Aware Scheduling for kuberay Oct 1, 2025
mode, modeOk := app.ObjectMeta.Labels[NetworkTopologyModeLabelKey]
highestTier, tierOk := app.ObjectMeta.Labels[NetworkTopologyHighestTierAllowedLabelKey]
if modeOk && tierOk {
highestTierInt, err := strconv.Atoi(highestTier)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we handle this error so that the user will get a better understanding of what happened?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we should. Changed my PR. Thanks for reviewing.


mode, modeOk := app.ObjectMeta.Labels[NetworkTopologyModeLabelKey]
highestTier, tierOk := app.ObjectMeta.Labels[NetworkTopologyHighestTierAllowedLabelKey]
if modeOk && tierOk {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

According to Network Topology Aware Scheduling Policy, the highestTierAllowed is not required if mode is soft. If the highestTierAllowed is not set with soft mode, the NetworkTopologySpec would not be set.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is a good catch.
Changed my PR.

Copy link
Collaborator

@win5923 win5923 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we should synchronize the NetworkTopology field for PodGroup when updating an existing RayCluster labels. WDYT?

@mtian29
Copy link
Author

mtian29 commented Oct 1, 2025

I think we should synchronize the NetworkTopology field for PodGroup when updating an existing RayCluster labels. WDYT?

Thanks. @win5923 Good point.

  • Where should we do this for update? I don't see a place to do it.
  • But even we can, we don't need to do it. Network topology is for scheduling, once a pod is bind to a node and initializing, changing labels/spec/network topology won't move the pod to a different node.

@mtian29 mtian29 changed the title [Feature] Support for Volcano Network Topology Aware Scheduling for kuberay [Feature] Support Volcano Network Topology Aware Scheduling for kuberay Oct 1, 2025
@win5923
Copy link
Collaborator

win5923 commented Oct 2, 2025

Thanks. @win5923 Good point.

Where should we do this for update? I don't see a place to do it.
But even we can, we don't need to do it. Network topology is for scheduling, once a pod is bind to a node and initializing, changing labels/spec/network topology won't move the pod to a different node.

We can do this in syncPodGroup, but I think you’re right.
This is only for scheduling. If users want to change the topology settings, they should recreate the RayCluster.

Copy link
Collaborator

@win5923 win5923 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the contribution, the changes look good to me.
But i think we should wait until #3972
is merged, as it includes the interface changes that might impact this implementation.

@mtian29
Copy link
Author

mtian29 commented Oct 3, 2025

Thanks for the contribution, the changes look good to me. But i think we should wait until #3972 is merged, as it includes the interface changes that might impact this implementation.

Thanks @win5923
The only impacted part should be app.ObjectMeta.Labels[xxxxx] => owner.GetLabels()[xxxx]

Do you know when is the next release of the kuberay operator. Can my change be included?

@win5923
Copy link
Collaborator

win5923 commented Oct 3, 2025

Do you know when is the next release of the kuberay operator. Can my change be included?

Nov. 1, 2025 (Branch Cut: Oct. 10).
Ref: https://docs.google.com/document/d/1rdXniNitHCNTGfyvvMPdMkp1cDnQtjOmfqNiWuvrS9A/edit?tab=t.0#heading=h.ctb1p12e6p4u

Yes, I believe this PR can be included in v1.5, as the changes are relatively minor and should be safe to merge before the cut-off.

@troychiu
Copy link
Collaborator

troychiu commented Oct 3, 2025

Sorry for the late reply. I'll review this PR asap.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants