Skip to content

Conversation

annapoornanarayan
Copy link
Contributor

@annapoornanarayan annapoornanarayan commented Aug 5, 2025

Issue #, if available:

Description of changes:
[DO NOT MERGE until nvidia-training PR is merged] The tests will only pass after the original changes are applied.
This PR contains changes to the nvidia test to deploy dcgm and cloudwatch manifests when enabled by --metricDimensions flag.
It also has standardization for flag formatting and common functions for daemonset deployment similar to nvidia-training.
By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.

Comment on lines 112 to 118
// Set default values
if testConfig.PytorchImage == "" {
testConfig.PytorchImage = "763104351884.dkr.ecr.us-west-2.amazonaws.com/pytorch-training:2.1.0-gpu-py310-cu121-ubuntu20.04-ec2"
}
if !testConfig.InstallDevicePlugin {
testConfig.InstallDevicePlugin = true
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why this move down here? The default flag setting should same as here happened before common.ParseFlags(&testConfig)

Comment on lines 120 to 123
renderedCloudWatchAgentManifest, err := manifests.RenderCloudWatchAgentManifest(testConfig.MetricDimensions)
if err != nil {
log.Printf("Warning: failed to render CloudWatch Agent manifest: %v", err)
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

same render manifest issue. put it under the if len(testConfig.MetricDimensions) > 0 {}

Comment on lines 138 to 175
testenv.Setup(
func(ctx context.Context, config *envconf.Config) (context.Context, error) {
err := fwext.ApplyManifests(config.Client().RESTConfig(), deploymentManifests...)
err := fwext.ApplyManifests(config.Client().RESTConfig(), manifestsList...)
if err != nil {
return ctx, err
}
return ctx, nil
},
deployMPIOperator,
}

if *installDevicePlugin {
deploymentManifests = append(deploymentManifests, manifests.NvidiaDevicePluginManifest)
setUpFunctions = append(setUpFunctions, deployNvidiaDevicePlugin)
}
func(ctx context.Context, config *envconf.Config) (context.Context, error) {
if testConfig.InstallDevicePlugin {
if ctx, err := common.DeployDaemonSet("nvidia-device-plugin-daemonset", "kube-system")(ctx, config); err != nil {
return ctx, err
}
}
if testConfig.EfaEnabled {
if ctx, err := common.DeployDaemonSet("aws-efa-k8s-device-plugin-daemonset", "kube-system")(ctx, config); err != nil {
return ctx, err
}
}
return ctx, nil
}, // Deploy device plugins conditionally

if *efaEnabled {
deploymentManifests = append(deploymentManifests, manifests.EfaDevicePluginManifest)
setUpFunctions = append(setUpFunctions, deployEFAPlugin)
}
func(ctx context.Context, config *envconf.Config) (context.Context, error) {
if len(testConfig.MetricDimensions) > 0 {
if ctx, err := common.DeployDaemonSet("dcgm-exporter", "kube-system")(ctx, config); err != nil {
return ctx, err
}
if ctx, err := common.DeployDaemonSet("cwagent", "amazon-cloudwatch")(ctx, config); err != nil {
return ctx, err
}
}
return ctx, nil
}, // Deploy CloudWatch Agent + DCGM only if MetricDimensions are set

setUpFunctions = append(setUpFunctions, checkNodeTypes)
testenv.Setup(setUpFunctions...)
checkNodeTypes, // Dynamically check node types and capacities after device plugins are ready
)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you keep the original implement and just append another func in setUpFunctions` to deploy the optional cw and exporter and then pass it to testenv.Setup() at once. The original implementation here is much cleaner.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants