-
Notifications
You must be signed in to change notification settings - Fork 313
[Develop][GB200] Add concurrent control mechanisms to nvidia-imex prolog script #6964
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Develop][GB200] Add concurrent control mechanisms to nvidia-imex prolog script #6964
Conversation
This commit enhances the prolog script to safely handle multiple concurrent job executions on the same compute node by implementing comprehensive concurrency control mechanisms. Changes: - Add file locking with retry mechanism and exponential backoff to prevent concurrent writes to IMEX config file - Implement global service lock to serialize nvidia-imex service restarts and avoid systemd conflicts - Add process-specific logging with lock-protected merge to prevent log file corruption from concurrent writes - Enhance logging with PID and JOB ID for better debugging of concurrent executions - Add intelligent config content checking to skip unnecessary service restarts when configuration is already correct - Implement stale lock cleanup mechanism to handle orphaned lock files from abnormally terminated processes Technical details: - File lock timeout: 60 seconds with 3 retries - Service lock timeout: 120 seconds - Exponential backoff: 2s, 4s, 6s between retries - Stale lock cleanup: Remove locks older than 90 seconds - Process-specific log files: /var/log/parallelcluster/nvidia-imex-prolog.log.{JOB_ID}.{PID} Tested scenarios: - Basic concurrent execution (3 jobs) - Forced concurrent writes with different configurations - Service restart serialization under high concurrency - Real cluster IP handling with forced restarts All tests passed successfully, confirming the script is production-ready for concurrent Slurm job execution environments.
tests/integration-tests/tests/ultraserver/test_gb200/test_gb200/91_nvidia_imex_prolog.sh
Outdated
Show resolved
Hide resolved
…tatus Problem: - GB200 integration test failed with IMEX status 'DEGRADED' instead of 'UP' - Root cause: the check IMEX config content and if no change, skip the IMEX reload mechanism caused inconsistent behavior across nodes - First node would update config and restart IMEX, subsequent nodes would skip restart - Result: partial IMEX initialization leading to cluster degradation Solution: - Implement job-based global decision mechanism using decision files - Ensure all nodes in same job execute identical operations (all skip or all restart) - Maintain performance benefits of intelligent skipping for unchanged configurations - Add proper concurrency protection for multi-job scenarios on same node Key changes: - Use job-specific decision files: .decision. - First node makes decision, subsequent nodes follow same decision - Atomic decision making with file locking - Automatic cleanup of decision files after job completion
Codecov Report✅ All modified and coverable lines are covered by tests. Additional details and impacted files@@ Coverage Diff @@
## develop #6964 +/- ##
========================================
Coverage 90.27% 90.27%
========================================
Files 182 182
Lines 16466 16466
========================================
Hits 14864 14864
Misses 1602 1602
Flags with carried forward coverage won't be shown. Click here to find out more. ☔ View full report in Codecov by Sentry. 🚀 New features to boost your workflow:
|
Problem: - Previous decision file mechanism caused cross-job interference - Job 4 would skip IMEX reload after Job 3 updated shared config file - Led to inconsistent IMEX cluster state across different job executions Solution: - Replace decision file mechanism with nvidia-imex-ctl status checking - Use 'nvidia-imex-ctl -N -j -c' to query current IMEX service configuration - Compare current IMEX IPs with expected cluster IPs to determine reload necessity - Each job independently evaluates IMEX status without cross-job dependencies Key changes: - Add check_imex_needs_reload() function using nvidia-imex-ctl status - Remove job-based decision file mechanism - Maintain file content comparison for config updates - Preserve concurrency protection with file and service locks
info "Sleeping ${WAIT_TIME_TO_STABILIZE} seconds to let IMEX stabilize" | ||
sleep ${WAIT_TIME_TO_STABILIZE} | ||
info "Checking IMEX nodes config ${IMEX_NODES_CONFIG}" | ||
if write_file "${IMEX_NODES_CONFIG}" "${IPS_FROM_CR}"; then |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If write file succeeded then why are we checking if Imex needs a reload or not.
write_file will return success only if it wrote the file, right?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
write_file will return success only if it wrote the file, right?
Yes.
If write file succeeded then why are we checking if Imex needs a reload or not.
I understand your point. I can not think an example that the config need to update + IMEX don't need to update.
This situation shouldn't occur in theory. If the config needs to be updated, it means the IP address has changed, and the IMEX status check will also detect a mismatch.
But we should always rely on this check and reload mechanism to double check, I can't think of a definite downside to using it.
timeout ${IMEX_START_TIMEOUT} systemctl start ${IMEX_SERVICE} | ||
info "Attempting to acquire service lock for IMEX reload" | ||
|
||
# Use a global lock to prevent concurrent service operations |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If we decide on supporting only exclusive jobs then we dont need a global lock. For now lets keep this
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I chnaged my mind! I think we should stick to the boundary of exclusive Jobs. So lets remove this.
I would suggest you keep the backup of your code though!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
+1
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done
…service is down - Add systemctl is-active check before IP comparison - Ensures IMEX service restarts when configured correctly but not running
|
||
# Get current IMEX status | ||
local imex_status_output | ||
if ! imex_status_output=$(timeout 30 /usr/bin/nvidia-imex-ctl -N -j -c "${_imex_config_file}" 2>/dev/null); then |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
My hesitation to use this command line is that if the nvidia-imex service goes down or has incoorect configuration, I dont think we can rely on nvidia-imex-ctl.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let's ignore it in this PR. Merge it and test.
if write_file "${IMEX_NODES_CONFIG}" "${IPS_FROM_CR}"; then | ||
info "IMEX nodes config updated, checking if reload is needed" | ||
if check_imex_needs_reload "${IPS_FROM_CR}" "${IMEX_MAIN_CONFIG}"; then | ||
info "IMEX reload needed, restarting service" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we combine this code base into single function, except for a log message it deos the same thing
info "IMEX nodes config unchanged, checking if reload is still needed"
if check_imex_needs_reload "${IPS_FROM_CR}" "${IMEX_MAIN_CONFIG}"; then
info "IMEX reload needed despite unchanged config, restarting service"
if reload_imex; then
info "Sleeping ${WAIT_TIME_TO_STABILIZE} seconds to let IMEX stabilize"
sleep ${WAIT_TIME_TO_STABILIZE}
else
error "Failed to reload IMEX service"
fi
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
+1
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done
else | ||
info "IMEX already configured correctly, skipping reload" | ||
fi | ||
else |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The else is same as if what is the point of the if condition then?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, done combining them
|
||
# Parse JSON to extract current IPs from IMEX status | ||
local current_imex_ips | ||
if ! current_imex_ips=$(echo "${imex_status_output}" | jq -r '.nodes[].host' 2>/dev/null | sort | tr '\n' ' '); then |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
double check it, it does seem aligned with the output schema (it is an array query but the schema is a dictionary): Example:
"nodes": {
"0": {
"status": "READY",
"host": "192.168.103.159",
"connections": {
"1": {
"host": "192.168.107.187",
"status": "CONNECTED",
"changed": true
},
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for pointing that out! Fixed.
@@ -13,12 +13,16 @@ WAIT_TIME_TO_STABILIZE=30 | |||
ALLOWED_INSTANCE_TYPES="^(p6e-gb200|g5g)" | |||
IMEX_SERVICE="nvidia-imex" | |||
|
|||
# Global service lock to prevent concurrent service operations | |||
SERVICE_LOCK_FILE="/var/lock/nvidia-imex-service.lock" | |||
SERVICE_LOCK_TIMEOUT=120 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
2min seems too much for a lock to interact with imex. How did you come with this time?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Removed as the service lock is removed
if write_file "${IMEX_NODES_CONFIG}" "${IPS_FROM_CR}"; then | ||
info "IMEX nodes config updated, checking if reload is needed" | ||
if check_imex_needs_reload "${IPS_FROM_CR}" "${IMEX_MAIN_CONFIG}"; then | ||
info "IMEX reload needed, restarting service" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
+1
timeout ${IMEX_START_TIMEOUT} systemctl start ${IMEX_SERVICE} | ||
info "Attempting to acquire service lock for IMEX reload" | ||
|
||
# Use a global lock to prevent concurrent service operations |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
+1
- Extract duplicate IMEX reload logic into handle_imex_reload function - Preserve all original log messages for different scenarios - Fix jq command to correctly parse IMEX status JSON structure (.nodes[].host -> .nodes | to_entries[].value.host) - Reduce nvidia-imex-ctl timeout from 30s to 15s for faster response The nodes field in IMEX status JSON is an object with node IDs as keys, not an array, so the previous jq expression was incorrect.
LGTM! |
Description of changes
Enhance the prolog script to safely handle multiple concurrent job executions on the same compute node by implementing comprehensive concurrency control mechanisms.
nvidia-imex-ctl
to get the current configure IP address list, compare with the expected list. If not the same, means some nodes were replaced, so that, reload IMEX.Tests
All tests passed successfully, confirming the script is production-ready for concurrent Slurm job execution environments.
Checklist
develop
add the branch name as prefix in the PR title (e.g.[release-3.6]
).Please review the guidelines for contributing and Pull Request Instructions.
By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.