-
Notifications
You must be signed in to change notification settings - Fork 462
add filestore fix for multi-learn node augsta checkpointing #1198
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
Summary of ChangesHello @saurabh111233212, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed! This pull request implements a critical fix to how checkpoint directories are constructed, specifically for multi-learn node setups. It ensures that all checkpoint paths are correctly routed to the Highlights
Using Gemini Code AssistThe full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips. Invoking Gemini You can request assistance from Gemini at any point by creating a comment using either
Customization To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a Limitations & Feedback Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here. You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension. Footnotes
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Code Review
This pull request adds a fix for checkpointing on multi-learn nodes on the 'augsta' cluster by ensuring checkpoint paths are prefixed with /filestore. While the intention is correct, the implementation has a bug where it creates an invalid path for relative checkpoint_state_dir values (e.g., my_checkpoints becomes /filestoremy_checkpoints). I've suggested a fix to correctly handle both relative and absolute paths.
Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>
| self.gs_checkpoint_state_dir = f"{self.gs_bucket_path}/{beaker_users}/{checkpoint_dir_name}" | ||
| else: | ||
| self.gs_checkpoint_state_dir = f"{self.gs_bucket_path}/{checkpoint_dir_name}" | ||
| if not checkpoint_dir_name.startswith("/filestore"): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Wdyt about making this logic live in mason?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ooh, that's a good idea.
I'll make the change
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@finbarrtimbers hmm, actually, this is the logic for auto-setting checkpoint state dir.
should we move the whole block to mason? Basically, can't move just the filestore bit because the checkpoint state dir might not exist before grpo_fast is actually called
but we could move the entire logic of auto-setting checkpoint_state_dir to mason (the entire if statement in which the if not checkpoint_dir_name.startswith("/filestore") check lives under)
Note
Ensure
checkpoint_state_diris rewritten under/filestorewhengs_bucket_pathis set and the path isn’t already in filestore.gs_bucket_pathis provided (andgs_checkpoint_state_dirunset), rewritecheckpoint_state_dirto/filestore/...if it doesn't already start with/filestore, before GCS download/calibration.Written by Cursor Bugbot for commit 7942a4a. This will update automatically on new commits. Configure here.