Skip to content

[Feature]: Implement MPI Environment Variable ValidationΒ #3126

@VishalPainjane

Description

@VishalPainjane

What you would like to be added?

I would like to add validation logic to the MPI runtime plugin to prevent users from manually setting internal MPI environment variables.

Specifically, I propose implementing the TODO found in pkg/runtime/framework/plugins/mpi/mpi.go:

// TODO (andreyvelich): Add validation to check that TrainJob doesn't have MPI envs.

Why is this needed?

The current MPI runtime plugin allows users to manually set environment variables in TrainJob.spec.trainer.env that conflict with the operator's internal configuration.

For example, if a user manually sets OMPI_MCA_orte_default_hostfile, it prevents the MPI launcher from correctly discovering workers, leading to training failures that are difficult to debug. Preventing these conflicts allows for a more robust and predictable training experience.

The Goal

We want to prevent users from manually setting specific environment variables (like OMPI_MCA_orte_default_hostfile) in their TrainJob. The MPI plugin manages these automatically, and manual overrides can cause the training job to fail or behave unpredictably.

The Plan

I propose making changes to three specific files. Here is the breakdown:

1. Define the Forbidden List

  • File: pkg/constants/constants.go
  • Action: Create a new constant MPIReservedEnvNames using the existing sets.New() pattern.
  • Content: It will include:
    • OMPI_MCA_orte_default_hostfile
    • OMPI_MCA_plm_rsh_args
    • OMPI_MCA_orte_keep_fqdn_hostnames
    • OMPI_MCA_orte_set_default_slots

2. Enforce the Rule

  • File: pkg/runtime/framework/plugins/mpi/mpi.go
  • Action: Update the Validate function.
  • Logic:
    • Loop through the user-provided environment variables in trainJob.Spec.Trainer.Env.
    • Check if any of them exist in the MPIReservedEnvNames list.
    • If a match is found, reject the job creation with a clear error message (e.g., "must not have reserved envs...").

3. Verify with Tests

  • File: pkg/runtime/framework/plugins/mpi/mpi_test.go
  • Action: Add a new unit test case.
  • Scenario: Attempt to validate a TrainJob that explicitly sets OMPI_MCA_orte_default_hostfile.
  • Expected Result: The test must confirm that the validation fails as expected.

Risk Assessment

  • Safety: This code does not change how valid jobs run. It only affects the validation phase before a job starts.
  • Compatibility: This is technically a breaking change for users who are currently (incorrectly) manually setting these variables. Their jobs will be rejected after this update.
    • My opinion: This is acceptable because those configurations are likely fragile or wrong, and this forces them to use the system correctly.

Love this feature?

Give it a πŸ‘ We prioritize the features with most πŸ‘

Metadata

Metadata

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions