[Develop][GB200] Add concurrent control mechanisms to nvidia-imex prolog script #6964

hehe7318 · 2025-09-04T21:47:04Z

Description of changes

Enhance the prolog script to safely handle multiple concurrent job executions on the same compute node by implementing comprehensive concurrency control mechanisms.

Add config content checking to skip unnecessary service restarts when the running IMEX service status IP address is already correct. The new mechanism is:
- Slurm job submitted, prolog got executed on several compute nodes
- Every node try to get the lock file and check if the IMEX config file content is the expected one
- Every node if IMEX service is not running, reload(start) it.
- Every node run nvidia-imex-ctl to get the current configure IP address list, compare with the expected list. If not the same, means some nodes were replaced, so that, reload IMEX.
Enhance logging with PID and JOB ID for better debugging

Tests

Basic concurrent execution (3 jobs)
Forced concurrent writes with different configurations
Service restart serialization under high concurrency
Real cluster IP handling with forced restarts
Integration tests

All tests passed successfully, confirming the script is production-ready for concurrent Slurm job execution environments.

Checklist

Make sure you are pointing to the right branch.
If you're creating a patch for a branch other than develop add the branch name as prefix in the PR title (e.g. [release-3.6]).
Check all commits' messages are clear, describing what and why vs how.
Make sure to have added unit tests or integration tests to cover the new/modified code.
Check if documentation is impacted by this change.

Please review the guidelines for contributing and Pull Request Instructions.

By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.

This commit enhances the prolog script to safely handle multiple concurrent job executions on the same compute node by implementing comprehensive concurrency control mechanisms. Changes: - Add file locking with retry mechanism and exponential backoff to prevent concurrent writes to IMEX config file - Implement global service lock to serialize nvidia-imex service restarts and avoid systemd conflicts - Add process-specific logging with lock-protected merge to prevent log file corruption from concurrent writes - Enhance logging with PID and JOB ID for better debugging of concurrent executions - Add intelligent config content checking to skip unnecessary service restarts when configuration is already correct - Implement stale lock cleanup mechanism to handle orphaned lock files from abnormally terminated processes Technical details: - File lock timeout: 60 seconds with 3 retries - Service lock timeout: 120 seconds - Exponential backoff: 2s, 4s, 6s between retries - Stale lock cleanup: Remove locks older than 90 seconds - Process-specific log files: /var/log/parallelcluster/nvidia-imex-prolog.log.{JOB_ID}.{PID} Tested scenarios: - Basic concurrent execution (3 jobs) - Forced concurrent writes with different configurations - Service restart serialization under high concurrency - Real cluster IP handling with forced restarts All tests passed successfully, confirming the script is production-ready for concurrent Slurm job execution environments.

tests/integration-tests/tests/ultraserver/test_gb200/test_gb200/91_nvidia_imex_prolog.sh

…tatus Problem: - GB200 integration test failed with IMEX status 'DEGRADED' instead of 'UP' - Root cause: the check IMEX config content and if no change, skip the IMEX reload mechanism caused inconsistent behavior across nodes - First node would update config and restart IMEX, subsequent nodes would skip restart - Result: partial IMEX initialization leading to cluster degradation Solution: - Implement job-based global decision mechanism using decision files - Ensure all nodes in same job execute identical operations (all skip or all restart) - Maintain performance benefits of intelligent skipping for unchanged configurations - Add proper concurrency protection for multi-job scenarios on same node Key changes: - Use job-specific decision files: .decision. - First node makes decision, subsequent nodes follow same decision - Atomic decision making with file locking - Automatic cleanup of decision files after job completion

codecov · 2025-09-05T21:22:39Z

Codecov Report

✅ All modified and coverable lines are covered by tests.
✅ Project coverage is 90.27%. Comparing base (40b07ca) to head (528923e).
⚠️ Report is 14 commits behind head on develop.

Additional details and impacted files

@@           Coverage Diff            @@
##           develop    #6964   +/-   ##
========================================
  Coverage    90.27%   90.27%           
========================================
  Files          182      182           
  Lines        16466    16466           
========================================
  Hits         14864    14864           
  Misses        1602     1602

Flag	Coverage Δ
unittests	`90.27% <ø> (ø)`

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:

❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

Problem: - Previous decision file mechanism caused cross-job interference - Job 4 would skip IMEX reload after Job 3 updated shared config file - Led to inconsistent IMEX cluster state across different job executions Solution: - Replace decision file mechanism with nvidia-imex-ctl status checking - Use 'nvidia-imex-ctl -N -j -c' to query current IMEX service configuration - Compare current IMEX IPs with expected cluster IPs to determine reload necessity - Each job independently evaluates IMEX status without cross-job dependencies Key changes: - Add check_imex_needs_reload() function using nvidia-imex-ctl status - Remove job-based decision file mechanism - Maintain file content comparison for config updates - Preserve concurrency protection with file and service locks

himani2411 · 2025-09-09T20:14:34Z

tests/integration-tests/tests/ultraserver/test_gb200/test_gb200/91_nvidia_imex_prolog.sh

-  info "Sleeping ${WAIT_TIME_TO_STABILIZE} seconds to let IMEX stabilize"
-  sleep ${WAIT_TIME_TO_STABILIZE}
+  info "Checking IMEX nodes config ${IMEX_NODES_CONFIG}"
+  if write_file "${IMEX_NODES_CONFIG}" "${IPS_FROM_CR}"; then


If write file succeeded then why are we checking if Imex needs a reload or not.

write_file will return success only if it wrote the file, right?

write_file will return success only if it wrote the file, right?

Yes.

If write file succeeded then why are we checking if Imex needs a reload or not.

I understand your point. I can not think an example that the config need to update + IMEX don't need to update.
This situation shouldn't occur in theory. If the config needs to be updated, it means the IP address has changed, and the IMEX status check will also detect a mismatch.

But we should always rely on this check and reload mechanism to double check, I can't think of a definite downside to using it.

himani2411 · 2025-09-09T20:19:31Z

tests/integration-tests/tests/ultraserver/test_gb200/test_gb200/91_nvidia_imex_prolog.sh

-  timeout ${IMEX_START_TIMEOUT} systemctl start ${IMEX_SERVICE}
+  info "Attempting to acquire service lock for IMEX reload"
+
+  # Use a global lock to prevent concurrent service operations


If we decide on supporting only exclusive jobs then we dont need a global lock. For now lets keep this

I chnaged my mind! I think we should stick to the boundary of exclusive Jobs. So lets remove this.

I would suggest you keep the backup of your code though!

…service is down - Add systemctl is-active check before IP comparison - Ensures IMEX service restarts when configured correctly but not running

himani2411 · 2025-09-10T14:43:59Z

tests/integration-tests/tests/ultraserver/test_gb200/test_gb200/91_nvidia_imex_prolog.sh

+
+  # Get current IMEX status
+  local imex_status_output
+  if ! imex_status_output=$(timeout 30 /usr/bin/nvidia-imex-ctl -N -j -c "${_imex_config_file}" 2>/dev/null); then


My hesitation to use this command line is that if the nvidia-imex service goes down or has incoorect configuration, I dont think we can rely on nvidia-imex-ctl.

Let's ignore it in this PR. Merge it and test.

himani2411 · 2025-09-10T20:10:39Z

tests/integration-tests/tests/ultraserver/test_gb200/test_gb200/91_nvidia_imex_prolog.sh

+  if write_file "${IMEX_NODES_CONFIG}" "${IPS_FROM_CR}"; then
+    info "IMEX nodes config updated, checking if reload is needed"
+    if check_imex_needs_reload "${IPS_FROM_CR}" "${IMEX_MAIN_CONFIG}"; then
+      info "IMEX reload needed, restarting service"


Can we combine this code base into single function, except for a log message it deos the same thing

info "IMEX nodes config unchanged, checking if reload is still needed" if check_imex_needs_reload "${IPS_FROM_CR}" "${IMEX_MAIN_CONFIG}"; then info "IMEX reload needed despite unchanged config, restarting service" if reload_imex; then info "Sleeping ${WAIT_TIME_TO_STABILIZE} seconds to let IMEX stabilize" sleep ${WAIT_TIME_TO_STABILIZE} else error "Failed to reload IMEX service" fi

himani2411 · 2025-09-10T20:34:41Z

tests/integration-tests/tests/ultraserver/test_gb200/test_gb200/91_nvidia_imex_prolog.sh

+    else
+      info "IMEX already configured correctly, skipping reload"
+    fi
+  else


The else is same as if what is the point of the if condition then?

Yes, done combining them

gmarciani · 2025-09-10T20:25:52Z

tests/integration-tests/tests/ultraserver/test_gb200/test_gb200/91_nvidia_imex_prolog.sh

+
+  # Parse JSON to extract current IPs from IMEX status
+  local current_imex_ips
+  if ! current_imex_ips=$(echo "${imex_status_output}" | jq -r '.nodes[].host' 2>/dev/null | sort | tr '\n' ' '); then


double check it, it does seem aligned with the output schema (it is an array query but the schema is a dictionary): Example:

"nodes": { "0": { "status": "READY", "host": "192.168.103.159", "connections": { "1": { "host": "192.168.107.187", "status": "CONNECTED", "changed": true },

Thanks for pointing that out! Fixed.

gmarciani · 2025-09-10T20:30:00Z

tests/integration-tests/tests/ultraserver/test_gb200/test_gb200/91_nvidia_imex_prolog.sh

@@ -13,12 +13,16 @@ WAIT_TIME_TO_STABILIZE=30
 ALLOWED_INSTANCE_TYPES="^(p6e-gb200|g5g)"
 IMEX_SERVICE="nvidia-imex"

+# Global service lock to prevent concurrent service operations
+SERVICE_LOCK_FILE="/var/lock/nvidia-imex-service.lock"
+SERVICE_LOCK_TIMEOUT=120


2min seems too much for a lock to interact with imex. How did you come with this time?

Removed as the service lock is removed

gmarciani · 2025-09-10T20:35:52Z

tests/integration-tests/tests/ultraserver/test_gb200/test_gb200/91_nvidia_imex_prolog.sh

+  if write_file "${IMEX_NODES_CONFIG}" "${IPS_FROM_CR}"; then
+    info "IMEX nodes config updated, checking if reload is needed"
+    if check_imex_needs_reload "${IPS_FROM_CR}" "${IMEX_MAIN_CONFIG}"; then
+      info "IMEX reload needed, restarting service"


gmarciani · 2025-09-10T20:36:04Z

tests/integration-tests/tests/ultraserver/test_gb200/test_gb200/91_nvidia_imex_prolog.sh

-  timeout ${IMEX_START_TIMEOUT} systemctl start ${IMEX_SERVICE}
+  info "Attempting to acquire service lock for IMEX reload"
+
+  # Use a global lock to prevent concurrent service operations


- Extract duplicate IMEX reload logic into handle_imex_reload function - Preserve all original log messages for different scenarios - Fix jq command to correctly parse IMEX status JSON structure (.nodes[].host -> .nodes | to_entries[].value.host) - Reduce nvidia-imex-ctl timeout from 30s to 15s for faster response The nodes field in IMEX status JSON is an object with node IDs as keys, not an array, so the previous jq expression was incorrect.

himani2411 · 2025-09-11T17:48:09Z

LGTM!

hehe7318 requested review from a team as code owners September 4, 2025 21:47

hehe7318 added skip-changelog-update Disables the check that enforces changelog updates in PRs 3.x labels Sep 4, 2025

himani2411 reviewed Sep 5, 2025

View reviewed changes

tests/integration-tests/tests/ultraserver/test_gb200/test_gb200/91_nvidia_imex_prolog.sh Outdated Show resolved Hide resolved

hehe7318 added 3 commits September 8, 2025 12:06

Remove the cleanup mechanism.

04a75f1

Remove the log enhancement.

825bc67

himani2411 reviewed Sep 9, 2025

View reviewed changes

fix: add IMEX service running check to prevent skipping restart when …

f9fc101

…service is down - Add systemctl is-active check before IP comparison - Ensures IMEX service restarts when configured correctly but not running

himani2411 reviewed Sep 10, 2025

View reviewed changes

gmarciani reviewed Sep 10, 2025

View reviewed changes

hehe7318 and others added 2 commits September 11, 2025 13:16

Merge branch 'develop' into wip/imex-prolog-concurrent-run-improvement

e106b70

himani2411 approved these changes Sep 11, 2025

View reviewed changes

Merge branch 'develop' into wip/imex-prolog-concurrent-run-improvement

cdf7bbf

hehe7318 enabled auto-merge (squash) September 12, 2025 17:30

hehe7318 merged commit 00a2656 into aws:develop Sep 12, 2025
24 checks passed

[Develop][GB200] Add concurrent control mechanisms to nvidia-imex prolog script #6964

[Develop][GB200] Add concurrent control mechanisms to nvidia-imex prolog script #6964

Conversation

hehe7318 commented Sep 4, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Description of changes

Tests

Checklist

Uh oh!

Uh oh!

codecov bot commented Sep 5, 2025

Codecov Report

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

himani2411 commented Sep 11, 2025

Uh oh!

Uh oh!

Uh oh!

hehe7318 commented Sep 4, 2025 •

edited

Loading