Skip to content
This repository was archived by the owner on Sep 19, 2022. It is now read-only.
Open
2 changes: 2 additions & 0 deletions manifests/crd.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -23,6 +23,8 @@ spec:
properties:
spec:
properties:
priorityClassName:
type: string
pytorchReplicaSpecs:
properties:
Master:
Expand Down
2 changes: 2 additions & 0 deletions manifests/podgroup.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -22,6 +22,8 @@ spec:
minMember:
format: int32
type: integer
priorityClassName:
type: string
type: object
status:
properties:
Expand Down
4 changes: 4 additions & 0 deletions pkg/apis/pytorch/v1/types.go
Original file line number Diff line number Diff line change
Expand Up @@ -69,6 +69,10 @@ type PyTorchJobSpec struct {
// "Worker": PyTorchReplicaSpec,
// }
PyTorchReplicaSpecs map[PyTorchReplicaType]*common.ReplicaSpec `json:"pytorchReplicaSpecs"`

//添加判断优先级的属性
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please use English here.

//add PriorityClassName
PriorityClassName string `json:"priorityClassName,omitempty"`
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
PriorityClassName string `json:"priorityClassName,omitempty"`
PriorityClassName *string `json:"priorityClassName,omitempty"`

Since it is optional, we can define it as a pointer.

}

// PyTorchReplicaType is the type for PyTorchReplica. Can be one of "Master" or "Worker".
Expand Down
4 changes: 3 additions & 1 deletion pkg/controller.v1/pytorch/controller.go
Original file line number Diff line number Diff line change
Expand Up @@ -437,7 +437,9 @@ func (pc *PyTorchController) reconcilePyTorchJobs(job *pyv1.PyTorchJob) error {

if pc.Config.EnableGangScheduling {
minAvailableReplicas := getTotalReplicas(job)
_, err := pc.SyncPodGroup(job, minAvailableReplicas)
priorityClassName:=getPriorityClassName(job)
//_, err := pc.SyncPodGroup(job, minAvailableReplicas)
_, err := pc.SyncPodGroupTest(job, minAvailableReplicas,priorityClassName)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why should we use SyncPodGroupTest instead of SyncPodGroup

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I add these codes in SyncPodGroupTest:
Spec: v1alpha1.PodGroupSpec{ MinMember: minAvailable.IntVal, PriorityClassName: priorityClassName, },

the name of this function is inappropriate,it is used to test my idea.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@gaocegege Do I detele my SyncPodGroupTest function and move the code my wrote into original SyncPodGroup function?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think so.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I got it.

if err != nil {
logger.Warnf("Sync PodGroup %v: %v", job.Name, err)
}
Expand Down
7 changes: 7 additions & 0 deletions pkg/controller.v1/pytorch/job.go
Original file line number Diff line number Diff line change
Expand Up @@ -216,3 +216,10 @@ func getTotalFailedReplicas(job *pyv1.PyTorchJob) int32 {
}
return totalFailedReplicas
}

func getPriorityClassName(job *pyv1.PyTorchJob) string {
var priorityClassName string
priorityClassName=job.Spec.PriorityClassName

return priorityClassName
}
1 change: 1 addition & 0 deletions pkg/controller.v1/pytorch/status.go
Original file line number Diff line number Diff line change
Expand Up @@ -69,6 +69,7 @@ func (pc *PyTorchController) updateStatusSingle(job *pyv1.PyTorchJob, rtype pyv1

// Expect to have `replicas - succeeded` pods alive.
commonType := common.ReplicaType(rtype)
//expected是成功的判断标志,等于0时,成功的数量等于副本数,认为成功
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please use English here

expected := replicas - int(job.Status.ReplicaStatuses[commonType].Succeeded)
running := int(job.Status.ReplicaStatuses[commonType].Active)
failed := int(job.Status.ReplicaStatuses[commonType].Failed)
Expand Down

Some generated files are not rendered by default. Learn more about how customized files appear on GitHub.

Some generated files are not rendered by default. Learn more about how customized files appear on GitHub.