-
Notifications
You must be signed in to change notification settings - Fork 1.3k
Closed
Labels
feature requestRequesting a new featureRequesting a new featurep3-nice-to-haveIt should be done this or next sprintIt should be done this or next sprintresearch
Description
From our dvc.org/chat:
jurasanYesterday at 9:44 PM
I found dvc working really well in my workflow. The only problem is it requires some discipline when changing data files without dvc run.
E.g. we often experiment in notebook and then when we want to commit we run notebook again from cli with dvc run. But if I forget to do it then i can commit new code to git without updating dvc checksums.
Now I am planning to add a git commit hook, that will run 'dvc status' command and check if there's someting changed (e.g just look for 'changed' word) , cancels commit and prompts user to fix dvc files.
Do you think this is the right way to go about it, or there's an easier path?
ivanYesterday at 11:28 PM
@jurasan git hook should do the trick, there should be a better a way to check it though (return status or API, @ruslan can give more info on this). A question to you, @jurasan , what OS system you are using (and your team)? On some OSs you need to be careful when you update files w/o DVC run. I also don't like that in your workflow you basically have to run it again even though you already have files to commit. We need to think a little bit how to improve this workflow. May be python API?
ruslanYesterday at 11:32 PM
I think the upcoming dvc commit should be the way to go ideally, but it is not ready yet. For now, git commit hook seems like a viable workaround.(edited)
ivanYesterday at 11:39 PM
I'm not sure dvc commit will help with this scenario - when you run something from Notebook you have to remember to do two things with dvc - run dvc remove first to clean previous results, second, run dvc run to update DVC files, (third, run dvc commit in the upcoming version). It will even extra step in this workflow. As far as I understand the question - it's more about avoiding all these manual steps.
ruslanYesterday at 11:53 PM
We have a command that does all two(three) of those: it is dvc repro, that you can all in your code or in the git hook, if repro's limitations are not a problem.
I got the impression, that @jurasan wants to have a friendly notifier that will tell 'hey, you have something changed in your pipeline. would you like to run dvc repro?`. Please correct me if I'm wrong.
Ah, looks like I didn't make it clear about dvc commit, that it will also have dvc install that will install git commit hook, that will notify that there are some changed things in the pipeline which are worth adressing. Sorry for not claryfying it right away.
October 22, 2018
jurasanToday at 12:00 AM
@ivan We use Ubuntu. We can do 'dvc run --no-exec' if the process is too long and are confident enough. What python API are we talking about?
Of course we would like to avoid manual steps but which is more important is to notify when you forgot smth and about to commit bad stuff.
@ruslan yes, that's what we want. We need as little disciple as we can afford.
@ruslan if dvc install makes hook automatically that would be great
@ruslan is this already planned or just an idea?
ruslanToday at 12:02 AM
Ivan forgot to specify that it is filesystem that matters, not the OS by itself. I suppose you are using EXT4 and you are not using shared cache dir for the whole team on the machine?
jurasanToday at 12:03 AM
correct
ruslanToday at 12:05 AM
Currently dvc install only installs post-checkout hook, that will automatically call dvc checkout. I guess pre-commit hook is something that users will benefit from even without dvc commit, we should probably consider adding it right now and not waiting for v1.0.
jurasanToday at 12:06 AM
Should I create a ticket?
ruslanToday at 12:07 AM
@jurasan I'll add ticket and will send a patch itself in a moment, so you can try it out :smiley:
π
1
jurasanToday at 12:08 AM
BTW what difference would shared cache dir make in this case?
ruslanToday at 12:09 AM
Ah, right. So the reason why @ivan asked about your OS is that dvc uses different methods of linking files from cache to your workspace depending on your workspace filesystem and cache filesystem combo.
For example, if you are running Btrfs or XFS on your drive and both your workspace and your cache are located on the same drive, dvc will utilize reflinks.
^ or APFS on MacOS also
But if you are using EXT4 in the same scenario, dvc will use hardlinks.
If your cache and workspace are on different drives, dvc will use symlinks.
jurasanToday at 12:11 AM
So if you are using the hardlinks and shared cache than dvc status will give warning everytime smbd else changes the cache?
ruslanToday at 12:12 AM
In the hard and symlink cases, you need to be careful to not modify your workspace files(that were already saved tocache) without removing them first, because you run the risk of corrupting your cache file underneath(dvc will throw a warning at you and remove that cache file if that does happen).
In the reflink case, you can safely modify your workspace files without running dvc remove first, because those are effectively copies of the cache(with optimization underneath).
jurasanToday at 12:13 AM
So basically Hardlinks+shared cache is a bad idea?
ruslanToday at 12:14 AM
> So if you are using the hardlinks and shared cache than dvc status will give warning everytime smbd else changes the cache?
Only if someone has corrupted some cache files and dvc didn't remove them earlier.
Not really a bad idea. If you would backup your cache with lets say dvc push to your remote, you should be safe.
Though we do know about this problem and preparing to introduce "protected" default mode, where hardlinks and symlinks will be protected with read-only mode.
jurasanToday at 12:16 AM
But it can be very hard to understand what happened and where does error come from.
ruslanToday at 12:19 AM
We understand that, but this is currently the best way to go in such scenarios. There are more and more filesystems that support reflinks, which will not have such limitation. Also, if your files are not giant and you can take the overhead of copying your workspace files a single time, you can set dvc to use copies instead of *links.
E.g. dvc config cache.type copy.
jurasanToday at 12:20 AM
Yes, that's what I'm thinking of doing
ruslanToday at 12:20 AM
How large are your files?
jurasanToday at 12:20 AM
3GB is the largest
ruslanToday at 12:20 AM
And I suppose you are using SSD?
jurasanToday at 12:20 AM
yep
ruslanToday at 12:22 AM
Sounds like a viable scenario to me. Also, if you would like to avoid copying, you can consider formatting your storage into Btrfs or XFS(this is default for Redhat/CentOS and similar) so dvc will be able to use reflinks.
@jurasan btw, if we were to set default link types to reflink, copy(so to use hardlink/symlink, you would need to know what you are doing and be aware of the limitations, but at least we won't set RO on links), how would you look at that as a dvc user?(edited)
Clarification: dvc config cache.type reflink,copy will make dvc only try those two types of links, instead of the current reflink, hardlink, symlink, copy.(edited)
jurasanToday at 12:28 AM
I think we really need more real world experience to understand what default is more appropriate
ruslanToday at 12:29 AM
But currently what is the main thing that you like in dvc? Is it the way it helps to originize your projects or is it something that helps avoid duplication? Or maybe something different?(edited)
jurasanToday at 12:32 AM
The biggest problem we had in previous project is smbd changing output data file that was used as an input to other notebooks. I really like how dvc helps keep it under control without being intrusive.
But I think that workflows described on webpage are lacking notebook. And most DSists i know are spending lots of time in notebook and don't won't to extract scripts from there. What I do now, as I described before, before commit I run notebook with dvc run [params] jupyter nbconvert ....
π
1
ruslanToday at 12:37 AM
@jurasan Thanks for sharing! So if we were to publish a python API, how do you see it used in your notebook? Something like
def func(...):
....
dvc.run(func, arg1, arg2, arg3)
?(edited)
We currently have an internal python API that is not ready for publishing and it is more similar to the CLI, so it accepts commands and files as arguments. Something like dvc.run('cmd input output', deps=['input'], outs=['output']), which doesn't seem that useful in a notebook, since you would have to extract your functions into CLI scripts.(edited)
jurasanToday at 12:49 AM
I think we don't actually have to run it again with dvc. We just need to update checksums. (dvc run --no-exec ???)
What I was thinking is we need a way to declaratively enforce it. I don't know if notebook magics allow to do this kind of stuff.
E.g.
%dvc
%dvc-inputs ...
%dvc-outputs ...
Then when the code is run in notebook, dvc is also called automatically to update checksums
ruslanToday at 12:52 AM
Hm, and what if you run with dvc run once and then use dvc repro to re-run instead of dvc run --no-exec + run yourself without dvc repro + update checksums?(edited)
Enforce updating checksums, correct?
jurasanToday at 12:56 AM
The problem is I don't want to run whole notebook all the time. It's common to run only part of notebook and have some errors in other parts or some long running cells that you don't want to run now
ruslanToday at 12:56 AM
Btw, i think we had a similar discussion with another user here https://github.com/iterative/dvc/issues/919 . The result of it was that we decided to add dvc commit.
GitHub
[Feature Request?] `dvc run ...` without actually running? Β· Issu...
If this is already possible, I apologise. Please let me know how it can be done. Say I'm working on a script that transforms data1.txt to data2.txt: (data1.txt) --> [script1.py] --> (data...
So the notebook effectively consists of a bunch of dvc stages or is it a single stage that you just want to run from the middle sometimes?
jurasanToday at 1:00 AM
Yeah, I see now that this kind of rechecking of checksums after each cell run doesn't solve the problem with consistency.
There's still a possibility that notebook was not executed fully
It seems thats the most useful for now would be
1) commit hook to check for updated files.
2) maybe a way to declare inputs and outpus inside notebooks and have a command - smth like dvc notebook ..., that will run it without having to write inputs outputs and all the needed nbconvert flags again
ruslanToday at 1:04 AM
So a notebook effectively becomes a dvc file of sorts?
Sounds very interesting!
Though it doesn't solve the partial execution problem. Since the simplest way to go would be to consider the notebook as a whole.
But I can see how it is possible to create "checkpoints" after running every line in the notebook.
The hook patch is comming, sorry for a delay :smiley:
jurasanToday at 1:18 AM
No hurry. I won't be able to check it right away anyway :)
I actually now think that running dvc after changes in notebook is a bad idea. Because than you'll have written in .dvc file that notebook was executed and correct checksums are in place but it might not be correct and notebook was executed only partially and running the whole notebook would give different outputs.
When we have the hook we'll have it explicitly shown that data is not consistent and can manually run whole notebook from cli. But the point 2) still holds.
Python dvc.run function would not help much eather. If you look at people's notebooks there's hardly ever a single function that you can run.(edited)
ruslanToday at 1:25 AM
With or without python API, it is clear that some modifications would be required for people using notebooks.
But if functions are so rare, then it is probably worth thinking about a single notebook as a single stage.
jurasanToday at 1:26 AM
Yes. And as it's predominant way of DS, at least at EDA stage, I think we should to really concentrate on simplifying workflow for it and describing it on website.
ruslanToday at 1:26 AM
I suppose you are still using multiple notebooks in a project, since you are able to use dvc with it right now?
jurasanToday at 1:26 AM
yes, many notebooks
yes, if we think about it as a single stage that running whole notebook from cli is better. We just need to notify if smth is wrong and make it easier ( now to many parameters to type)
ruslanToday at 1:31 AM
I see.So lets say we've added an integration for notebooks, that allows you to specify dependencies/outputs in your notebook(e.g. as a header consisting of comments or maybe some dvc API calls at the top) and when you run your notebook, we somehow make dvc automatically run and save all the checksums and stuff like that after your whole script successfully finishes. How does it sound to you?
I guess the main benefit would be the fact of sparing dvc run command, right?
jurasanToday at 1:40 AM
Yes, this sounds good. But I'm afraid we won't be able to run dvc automatically. we don't know if the whole script was run or only part of it.
That's actually the bigger problem we've had in projects. Sometimes people run a part of notebook that generates a file, result is good so they commit notebook. And afterwards if somebody runs the notebook to the end he gets different result. We shall look if it's possible to determine if all the notebook's code was executed after last modification before commit. That seems like not a dvc's issue, but actually is integral to data-code consistency
ruslanToday at 1:40 AM
FYI: I've merged the pre-commit hook patch and releasing 0.19.14 with it right now., will be ready in an hour or so. Please feel free to try it out when you have time :smiley:
jurasanToday at 1:41 AM
Great.
ruslanToday at 1:41 AM
That is a great point!
I guess the current manual dvc run kinda helps avoid that problem, right?(edited)
Since you are sure that the whole script has been run.
jurasanToday at 1:43 AM
partially. Imagine situation where you have changed the notebook, but didn't run it. All the data is unchanged so nothing gives warnings. But data and code are not consistent anymore
ruslanToday at 1:43 AM
Dvc will give a warning
since the md5 of that notebook has changed since the last run(edited)
jurasanToday at 1:43 AM
oh. yeah. forgot about it
Than you're right. Then we get back to the 2) point.
We just need easier way to handle notebook running.
Maybe not and I'm just being lazy
ruslanToday at 1:45 AM
:smiley:
I mean if the pain point exists, then it is worth trying to solve :smiley:
jurasanToday at 1:48 AM
I'm not sure yet. dvc repro is short enough, so you only need to type inputs outputs the first time.
The problem I see with putting inputs-outputs inside notebook, is that now we have 2 places that hold them: notebook and .dvc file. And if we change them by executing dvc run nbconver .... again, than notebook will have stale description
ruslanToday at 1:52 AM
Ah, good point!
Unless dvc will check if it has notebook as a dependency and will automatically go check if it's description matches the one that the .dvc file has.
But, that is a more magic on top of it.
jurasanToday at 1:54 AM
yep, too fragile in my opinion
ruslanToday at 1:54 AM
At least with manual dvc run you have less magic going, i think it is easier to wrap your head around it
jurasanToday at 1:56 AM
Ok. I think I'll need to try workflow with new patch now and digest things a little. Thank you for such a quick reaction.
ruslanToday at 1:57 AM
Thank you so much for the feedback! I really enjoyed our discussion, it is extremely useful!
jurasanToday at 1:58 AM
But I would like to stress one more time how important is to document the notebook workflow. It was a major stumbling block for me and I spent lot of time to put pieces together.
ruslanToday at 1:58 AM
Let me create a ticket for it, so we can have some persistant place for everything
jurasanToday at 1:59 AM
If you create ticket for this documentation I can help writing it when I try things.
ruslanToday at 1:59 AM
I'll create a ticket for that as well. I guess it is worth putting into the "Use Cases" section on dvc.org/documentation
Oh, that would be amazing!
Just a second,creating right now
NEW MESSAGES
jurasanToday at 2:00 AM
And if you can copy our conversation there it would be good I think
Reactions are currently unavailable
Metadata
Metadata
Assignees
Labels
feature requestRequesting a new featureRequesting a new featurep3-nice-to-haveIt should be done this or next sprintIt should be done this or next sprintresearch