-
Notifications
You must be signed in to change notification settings - Fork 133
Ability to rerun a job with checkpoints in Studio using datachain job run CLI command
#1554
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
Deploying datachain with
|
| Latest commit: |
aa67b5f
|
| Status: | ✅ Deploy successful! |
| Preview URL: | https://92104e91.datachain-2g6.pages.dev |
| Branch Preview URL: | https://ilongin-12370-job-rerun-from-tiik.datachain-2g6.pages.dev |
Codecov Report❌ Patch coverage is
📢 Thoughts on this report? Let us know! |
| parent_job_id: str | None = None, | ||
| rerun_from_job_id: str | None = None, | ||
| run_group_id: str | None = None, | ||
| is_studio_copy: bool = False, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Are there any better options for this name?
is_remote_jobis_saas_job- etc
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I was thinking a lot about naming and the reason I chose is_studio_copy is that it's by default False in both Studio and CLI .. it's only True in rare cases in CLI when we copy job from Studio. It is also very expressive / verbose and those 2 will give us the least amount of mental overhead (and less migration as well).
is_remote_job and is_saas_job will both be True in Studio and False in CLI by default which is little bit of minus IMO.. otherwise they are all pretty much similar.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Does is_studio_copy mean it is completely only in Studio or has_saas_copy or even has_studio_copy makes more sense?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
But we do not copy job from Studio, we are running it remotely in Studio, this is why I am agains copy in the name.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, that makes sense too. How about, has_studio_run or is_in_saas?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@dreadatour yes, but job is still created and ran in Studio. Locally we just save link / reference / copy to it.
Couple of other ideas:
is_studio_referenceis_executed_remotely
Maybe the second one since it's similar to your suggestion is_remote_job but it doesn't have "job" word in it since it's not needed
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You're right! job is redundant. What do you think about simple is_remote flag? (executed is also not needed IMO)
Anyway, these options looks much better to me personally 👍🙏
…atachain-ai/datachain into ilongin/12370-job-rerun-from-cli-fixed
| script_path = os.path.abspath(query_file) | ||
|
|
||
| rerun_from_job_id = None | ||
| rerun_from_job = catalog.metastore.get_last_job_by_name( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
are we fetching it from Studio here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
or is it a local metastore / catalog in this case?
|
|
||
| @skip_if_not_sqlite | ||
| def test_studio_run_connect_to_previous_job(capsys, mocker, tmp_dir, studio_token): | ||
| """Test that job run sends rerun_from_job_id from local Studio job copy.""" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
description is not clear (AI generated) / cleanup, make test name clear enough, remove comment, cleanup other comments and noise in the test
shcheklein
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We need docs update
This PR fixes job connection (correctly setting
rerun_job_idfield) when re-running same script multiple times from local to Studio usingdatachain job run my_script.pyThis was fixed by:
is_studio_copyboolean field to job.is_studio_copyis set to True). Job name is set to full script path (similar as we do when running jobs locally via python). Then on re-running we check for jobs that have same full script path as name and if it's copy from studio job and if we find job we sendrerun_from_job_idfield as well to Studio. Then Studio connects new job to previous one correctly and can use checkpoints etc.