feat: add proposal for data support#650
Conversation
| ``` | ||
| envd data add -f mnist.yaml | ||
| ``` |
There was a problem hiding this comment.
I wonder if we can make it as an envd target so we can get rid of yaml?
|
|
||
| ### Access Pattern | ||
|
|
||
| The access pattern of most dataset is write once, read multiple times, and concurrently. Therefore |
There was a problem hiding this comment.
Missing some additional context after "therefore"?
| The access pattern of most dataset is write once, read multiple times, and concurrently. Therefore | ||
|
|
||
| ### Possible versions/tags | ||
| - Version by number, V1, V2, V3 |
There was a problem hiding this comment.
How about semantic versioning?
|
|
||
| mnist.yaml | ||
| ```yaml= | ||
| ApiVersion: V1alpha |
There was a problem hiding this comment.
What is this version for? How is this different from the version below?
| version: "0.0.1-sample" | ||
| sources: |
There was a problem hiding this comment.
What if there are multiple versions for different sources?
|
|
||
| ## Common Scenarios | ||
|
|
||
| ### Possible sources |
There was a problem hiding this comment.
Is there any implementation plan for this? Are you aware of any existing solutions that could support multiple sources? This might be helpful as a reference for the range of sources: https://kubernetes.io/docs/concepts/storage/volumes/#volume-types
| ``` | ||
| def data(): | ||
| return [d.mount("mnist", target="./data")] # User can specify mount multiple datasets |
There was a problem hiding this comment.
What are the different purposes of the YAML above and this envd syntax?
Data function support
Summary
To provide mount data support for envd
Goals
Design a unified, declarative interface and underlying architecture to provide dataset in the development environment in a scalable way
Non-goals:
Common Scenarios
Possible sources
Possible form
Access Pattern
The access pattern of most dataset is write once, read multiple times, and concurrently. Therefore
Possible versions/tags
We can have a new standard on how to version the data like semver
Proposal
Each version of dataset is immutable. By assuming the data is immutable, we can cache the data and make replication easily, to increase the read throughput in multiple ways.
Usage
User need to create the dataset beforehand. Than declare mounting in the build.envd file.
User can create multiple dataset with the same name, but need to be different versions
mnist.yaml
build.envd