Skip to content

Conversation

hehe7318
Copy link
Contributor

@hehe7318 hehe7318 commented Sep 4, 2025

Description of changes

Enhance the prolog script to safely handle multiple concurrent job executions on the same compute node by implementing comprehensive concurrency control mechanisms.

  • Add config content checking to skip unnecessary service restarts when the running IMEX service status IP address is already correct. The new mechanism is:
    • Slurm job submitted, prolog got executed on several compute nodes
    • Every node try to get the lock file and check if the IMEX config file content is the expected one
    • Every node if IMEX service is not running, reload(start) it.
    • Every node run nvidia-imex-ctl to get the current configure IP address list, compare with the expected list. If not the same, means some nodes were replaced, so that, reload IMEX.
  • Enhance logging with PID and JOB ID for better debugging

Tests

  • Basic concurrent execution (3 jobs)
  • Forced concurrent writes with different configurations
  • Service restart serialization under high concurrency
  • Real cluster IP handling with forced restarts
  • Integration tests

All tests passed successfully, confirming the script is production-ready for concurrent Slurm job execution environments.

Checklist

  • Make sure you are pointing to the right branch.
  • If you're creating a patch for a branch other than develop add the branch name as prefix in the PR title (e.g. [release-3.6]).
  • Check all commits' messages are clear, describing what and why vs how.
  • Make sure to have added unit tests or integration tests to cover the new/modified code.
  • Check if documentation is impacted by this change.

Please review the guidelines for contributing and Pull Request Instructions.

By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.

This commit enhances the prolog script to safely handle multiple concurrent
job executions on the same compute node by implementing comprehensive
concurrency control mechanisms.

Changes:
- Add file locking with retry mechanism and exponential backoff to prevent
  concurrent writes to IMEX config file
- Implement global service lock to serialize nvidia-imex service restarts
  and avoid systemd conflicts
- Add process-specific logging with lock-protected merge to prevent log
  file corruption from concurrent writes
- Enhance logging with PID and JOB ID for better debugging of concurrent
  executions
- Add intelligent config content checking to skip unnecessary service
  restarts when configuration is already correct
- Implement stale lock cleanup mechanism to handle orphaned lock files
  from abnormally terminated processes

Technical details:
- File lock timeout: 60 seconds with 3 retries
- Service lock timeout: 120 seconds
- Exponential backoff: 2s, 4s, 6s between retries
- Stale lock cleanup: Remove locks older than 90 seconds
- Process-specific log files: /var/log/parallelcluster/nvidia-imex-prolog.log.{JOB_ID}.{PID}

Tested scenarios:
- Basic concurrent execution (3 jobs)
- Forced concurrent writes with different configurations
- Service restart serialization under high concurrency
- Real cluster IP handling with forced restarts

All tests passed successfully, confirming the script is production-ready
for concurrent Slurm job execution environments.
@hehe7318 hehe7318 requested review from a team as code owners September 4, 2025 21:47
@hehe7318 hehe7318 added skip-changelog-update Disables the check that enforces changelog updates in PRs 3.x labels Sep 4, 2025
…tatus

Problem:
- GB200 integration test failed with IMEX status 'DEGRADED' instead of 'UP'
- Root cause: the check IMEX config content and if no change, skip the IMEX reload mechanism caused inconsistent behavior across nodes
- First node would update config and restart IMEX, subsequent nodes would skip restart
- Result: partial IMEX initialization leading to cluster degradation

Solution:
- Implement job-based global decision mechanism using decision files
- Ensure all nodes in same job execute identical operations (all skip or all restart)
- Maintain performance benefits of intelligent skipping for unchanged configurations
- Add proper concurrency protection for multi-job scenarios on same node

Key changes:
- Use job-specific decision files: .decision.
- First node makes decision, subsequent nodes follow same decision
- Atomic decision making with file locking
- Automatic cleanup of decision files after job completion
Copy link

codecov bot commented Sep 5, 2025

Codecov Report

✅ All modified and coverable lines are covered by tests.
✅ Project coverage is 90.27%. Comparing base (40b07ca) to head (528923e).
⚠️ Report is 14 commits behind head on develop.

Additional details and impacted files
@@           Coverage Diff            @@
##           develop    #6964   +/-   ##
========================================
  Coverage    90.27%   90.27%           
========================================
  Files          182      182           
  Lines        16466    16466           
========================================
  Hits         14864    14864           
  Misses        1602     1602           
Flag Coverage Δ
unittests 90.27% <ø> (ø)

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

Problem:
- Previous decision file mechanism caused cross-job interference
- Job 4 would skip IMEX reload after Job 3 updated shared config file
- Led to inconsistent IMEX cluster state across different job executions

Solution:
- Replace decision file mechanism with nvidia-imex-ctl status checking
- Use 'nvidia-imex-ctl -N -j -c' to query current IMEX service configuration
- Compare current IMEX IPs with expected cluster IPs to determine reload necessity
- Each job independently evaluates IMEX status without cross-job dependencies

Key changes:
- Add check_imex_needs_reload() function using nvidia-imex-ctl status
- Remove job-based decision file mechanism
- Maintain file content comparison for config updates
- Preserve concurrency protection with file and service locks
info "Sleeping ${WAIT_TIME_TO_STABILIZE} seconds to let IMEX stabilize"
sleep ${WAIT_TIME_TO_STABILIZE}
info "Checking IMEX nodes config ${IMEX_NODES_CONFIG}"
if write_file "${IMEX_NODES_CONFIG}" "${IPS_FROM_CR}"; then
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If write file succeeded then why are we checking if Imex needs a reload or not.

write_file will return success only if it wrote the file, right?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

write_file will return success only if it wrote the file, right?

Yes.

If write file succeeded then why are we checking if Imex needs a reload or not.

I understand your point. I can not think an example that the config need to update + IMEX don't need to update.
This situation shouldn't occur in theory. If the config needs to be updated, it means the IP address has changed, and the IMEX status check will also detect a mismatch.

But we should always rely on this check and reload mechanism to double check, I can't think of a definite downside to using it.

timeout ${IMEX_START_TIMEOUT} systemctl start ${IMEX_SERVICE}
info "Attempting to acquire service lock for IMEX reload"

# Use a global lock to prevent concurrent service operations
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we decide on supporting only exclusive jobs then we dont need a global lock. For now lets keep this

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I chnaged my mind! I think we should stick to the boundary of exclusive Jobs. So lets remove this.

I would suggest you keep the backup of your code though!

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

…service is down

- Add systemctl is-active check before IP comparison
- Ensures IMEX service restarts when configured correctly but not running

# Get current IMEX status
local imex_status_output
if ! imex_status_output=$(timeout 30 /usr/bin/nvidia-imex-ctl -N -j -c "${_imex_config_file}" 2>/dev/null); then
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My hesitation to use this command line is that if the nvidia-imex service goes down or has incoorect configuration, I dont think we can rely on nvidia-imex-ctl.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's ignore it in this PR. Merge it and test.

if write_file "${IMEX_NODES_CONFIG}" "${IPS_FROM_CR}"; then
info "IMEX nodes config updated, checking if reload is needed"
if check_imex_needs_reload "${IPS_FROM_CR}" "${IMEX_MAIN_CONFIG}"; then
info "IMEX reload needed, restarting service"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we combine this code base into single function, except for a log message it deos the same thing

info "IMEX nodes config unchanged, checking if reload is still needed"
    if check_imex_needs_reload "${IPS_FROM_CR}" "${IMEX_MAIN_CONFIG}"; then
      info "IMEX reload needed despite unchanged config, restarting service"
      if reload_imex; then
        info "Sleeping ${WAIT_TIME_TO_STABILIZE} seconds to let IMEX stabilize"
        sleep ${WAIT_TIME_TO_STABILIZE}
      else
        error "Failed to reload IMEX service"
      fi

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

else
info "IMEX already configured correctly, skipping reload"
fi
else
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The else is same as if what is the point of the if condition then?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, done combining them


# Parse JSON to extract current IPs from IMEX status
local current_imex_ips
if ! current_imex_ips=$(echo "${imex_status_output}" | jq -r '.nodes[].host' 2>/dev/null | sort | tr '\n' ' '); then
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

double check it, it does seem aligned with the output schema (it is an array query but the schema is a dictionary): Example:

"nodes": {
      "0": {
       "status": "READY",
       "host": "192.168.103.159",
       "connections": {
        "1": {
         "host": "192.168.107.187",
         "status": "CONNECTED",
         "changed": true
        },

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for pointing that out! Fixed.

@@ -13,12 +13,16 @@ WAIT_TIME_TO_STABILIZE=30
ALLOWED_INSTANCE_TYPES="^(p6e-gb200|g5g)"
IMEX_SERVICE="nvidia-imex"

# Global service lock to prevent concurrent service operations
SERVICE_LOCK_FILE="/var/lock/nvidia-imex-service.lock"
SERVICE_LOCK_TIMEOUT=120
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

2min seems too much for a lock to interact with imex. How did you come with this time?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Removed as the service lock is removed

if write_file "${IMEX_NODES_CONFIG}" "${IPS_FROM_CR}"; then
info "IMEX nodes config updated, checking if reload is needed"
if check_imex_needs_reload "${IPS_FROM_CR}" "${IMEX_MAIN_CONFIG}"; then
info "IMEX reload needed, restarting service"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1

timeout ${IMEX_START_TIMEOUT} systemctl start ${IMEX_SERVICE}
info "Attempting to acquire service lock for IMEX reload"

# Use a global lock to prevent concurrent service operations
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1

hehe7318 and others added 2 commits September 11, 2025 13:16
- Extract duplicate IMEX reload logic into handle_imex_reload function
- Preserve all original log messages for different scenarios
- Fix jq command to correctly parse IMEX status JSON structure
  (.nodes[].host -> .nodes | to_entries[].value.host)
- Reduce nvidia-imex-ctl timeout from 30s to 15s for faster response

The nodes field in IMEX status JSON is an object with node IDs as keys,
not an array, so the previous jq expression was incorrect.
@himani2411
Copy link
Contributor

LGTM!

@hehe7318 hehe7318 enabled auto-merge (squash) September 12, 2025 17:30
@hehe7318 hehe7318 merged commit 00a2656 into aws:develop Sep 12, 2025
24 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
3.x skip-changelog-update Disables the check that enforces changelog updates in PRs
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants