-
Notifications
You must be signed in to change notification settings - Fork 198
avoid save_to_memory running too fast then causing save_to_storage to fail #1649
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: master
Are you sure you want to change the base?
avoid save_to_memory running too fast then causing save_to_storage to fail #1649
Conversation
Codecov Report❌ Patch coverage is ❌ Your patch check has failed because the patch coverage (0.00%) is below the target coverage (90.00%). You can increase the patch coverage or adjust the target coverage.
Additional details and impacted files@@ Coverage Diff @@
## master #1649 +/- ##
===========================================
- Coverage 80.13% 15.09% -65.04%
===========================================
Files 228 228
Lines 22213 22228 +15
===========================================
- Hits 17801 3356 -14445
- Misses 4412 18872 +14460 ☔ View full report in Codecov by Sentry. |
self._event_queue = SharedQueue(name=qname, create=True) | ||
self._notify_queues = [] | ||
for i in range(self.local_shard_num): | ||
self._notify_queues.append( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
use a function to generate queue name
|
||
@timer | ||
def save_to_memory(self, step, state_dict, paths: Dict[str, str]): | ||
def _save_to_memory(self, step, state_dict, paths: Dict[str, str]): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This should not be modified, as it is inherently a public abstract method.
["model_states", "optim_states"] of the state dict and | ||
the value is the path of storage to save. | ||
""" | ||
if self._checkpoint_event_step > 0: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should be basic impl from parent class.
|
||
|
||
@dataclass | ||
class CheckpointNotifyEvent: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is not used?
["model_states", "optim_states"] of the state dict and | ||
the value is the path of storage to save. | ||
""" | ||
if self._checkpoint_event_step > 0: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also i don't get the point why not using 'last_step' but use '_checkpoint_event_step' instead?
What changes were proposed in this pull request?
This PR based on #1585
Previously, a save_to_memory call could overwrite a shared memory buffer while the background saver process was still reading from it for a save_to_storage operation. To prevent this, a notification queue was introduced in #1585, where the saver process notifies the training process once the buffer is free.
However, the implementation in #1585 only had rank 0 consume this notification. This caused all non-rank-0 processes to hang indefinitely on subsequent checkpoints, as they never consumed the event from their corresponding notification queue.
This patch fixes the issue by ensuring that every rank waits for and consumes the notification from its dedicated queue before proceeding with the next save_to_memory operation. This guarantees that the shared memory buffer is not written to prematurely.
Why are the changes needed?
Without proper synchronization across all ranks, tasks would hang during checkpointing, especially when checkpoints were triggered in rapid succession. This fix ensures the reliability of the asynchronous saving process by correctly coordinating the main training processes and the background saver.
Does this PR introduce any user-facing change?
No
How was this patch tested?
Verified in live training jobs. The race condition is resolved and the task no longer hangs.