-
Notifications
You must be signed in to change notification settings - Fork 118
Add cli support to move, remove and copy file to storage using Studio #1221
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Changes from all commits
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,109 @@ | ||
# cp | ||
|
||
Copy storage files and directories between cloud and local storage. | ||
|
||
## Synopsis | ||
|
||
```usage | ||
usage: datachain cp [-h] [-v] [-q] [-r] [--team TEAM] | ||
[-s] [--anon] [--update] | ||
source_path destination_path | ||
``` | ||
|
||
## Description | ||
|
||
This command copies files and directories between local and/or remote storage. This uses the credentials in your system by default or can use the cloud authentication from Studio. | ||
|
||
The command supports two main modes of operation: | ||
|
||
- By default, the command operates directly with clouds using credentials in your system, supporting various copy scenarios between local and remote storage. | ||
- When using `-s` or `--studio-cloud-auth` flag, the command uses credentials from Studio for cloud operations. This mode provides enhanced authentication and access control for cloud storage operations. | ||
|
||
|
||
## Arguments | ||
|
||
* `source_path` - Path to the source file or directory to copy | ||
* `destination_path` - Path to the destination file or directory to copy to | ||
|
||
## Options | ||
|
||
* `-r`, `-R`, `--recursive` - Copy directories recursively | ||
* `--team TEAM` - Team name to use the credentials from. (Default: from config) | ||
* `-s`, `--studio-cloud-auth` - Use credentials from Studio for cloud operations (Default: False) | ||
shcheklein marked this conversation as resolved.
Show resolved
Hide resolved
|
||
* `--anon` - Use anonymous access for cloud operations (Default: False) | ||
* `--update` - Update cached list of files for the source when downloading from cloud using local credentials. | ||
* `-h`, `--help` - Show the help message and exit | ||
* `-v`, `--verbose` - Be verbose | ||
* `-q`, `--quiet` - Be quiet | ||
|
||
|
||
## Notes | ||
* When using Studio cloud auth mode, you must be authenticated with `datachain auth login` before using it | ||
* The default mode operates directly with storage providers | ||
|
||
|
||
## Examples | ||
### Local to Local | ||
|
||
**Operation**: Direct local file system copy | ||
- Uses the local filesystem's native copy operation | ||
- Fastest operation as no network transfer is involved | ||
- Supports both files and directories | ||
|
||
```bash | ||
datachain cp /path/to/source/file.py /path/to/destination/file.py | ||
datachain cp -r /path/to/source/directory /path/to/destination/directory | ||
``` | ||
|
||
### Local to Remote | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I don't think we need so many examples ... There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I think otherwise, There are exactly 4 different types of examples to show what are supported. Within each type, we have file and folders each using local and studio authentication. If you think otherwise, let me know which examples to keep. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. just keep 2-3 examples, command and options should be more or less self descriptive or make examples more comprehensive - e.g. do Studio auth, etc ... again, otherwise we just repeat the option descritptions There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more.
Which ones? Asking before making any changes to avoid back and forth. |
||
|
||
**Operation**: Upload to cloud storage | ||
- Uploads local files/directories to remote storage | ||
- Supports both default mode and Studio cloud auth mode | ||
- Requires `--recursive` flag for directories | ||
|
||
```bash | ||
# Upload single file | ||
datachain cp /path/to/local/file.py gs://my-bucket/data/file.py | ||
|
||
# Upload single file with Studio cloud auth | ||
datachain cp /path/to/local/file.py gs://my-bucket/data/file.py --studio-cloud-auth | ||
|
||
# Upload directory recursively | ||
datachain cp --recursive /path/to/local/directory gs://my-bucket/data/ | ||
|
||
# Upload directory recursively with Studio cloud auth | ||
datachain cp --recursive /path/to/local/directory gs://my-bucket/data/ --studio-cloud-auth | ||
``` | ||
|
||
### Remote to Local | ||
|
||
**Operation**: Download from cloud storage | ||
- Downloads remote files/directories to local storage | ||
- Automatically extracts filename if destination is a directory | ||
- Creates destination directory if it doesn't exist | ||
|
||
```bash | ||
# Download single file | ||
datachain cp gs://my-bucket/data/file.py /path/to/local/directory/ | ||
|
||
# Download single file with Studio cloud auth | ||
datachain cp gs://my-bucket/data/file.py /path/to/local/directory/ --studio-cloud-auth | ||
|
||
# Download directory recursively | ||
datachain cp -r gs://my-bucket/data/directory /path/to/local/directory/ | ||
``` | ||
|
||
### Remote to Remote | ||
|
||
**Operation**: Copy within cloud storage | ||
- Copies files between locations between cloud storages | ||
- Requires `--recursive` flag for directories | ||
|
||
```bash | ||
# Copy within same bucket | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Is it possible to copy from one remote (bucket or even remote type) to another one? There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Yes. |
||
datachain cp gs://my-bucket/data/file.py gs://my-bucket/archive/file.py | ||
|
||
# Copy within same bucket with Studio cloud auth | ||
datachain cp gs://my-bucket/data/file.py gs://my-bucket/archive/file.py --studio-cloud-auth | ||
``` |
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,79 @@ | ||
# mv | ||
|
||
Move storage files and directories in clouds or local filesystem. | ||
|
||
## Synopsis | ||
|
||
```usage | ||
usage: datachain mv [-h] [-v] [-q] [--recursive] | ||
[--team TEAM] [-s] path new_path | ||
``` | ||
|
||
## Description | ||
|
||
This command moves files and directories within storage. The command supports both individual files and directories, with the `--recursive` flag required for moving directories. | ||
|
||
## Arguments | ||
|
||
* `path` - Path to the storage file or directory to move | ||
* `new_path` - New path where the file or directory should be moved to | ||
|
||
## Options | ||
|
||
* `--recursive` - Move recursively | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. is it a regular thing to do require There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I am not sure, we do have it with the call to fs.mv . LMK if we want to remove this., |
||
* `--team TEAM` - Team name to use the credentials from. (Default: from config) | ||
* `-s`, `--studio-cloud-auth` - Use credentials from Studio for cloud operations (Default: False) | ||
* `-h`, `--help` - Show the help message and exit | ||
* `-v`, `--verbose` - Be verbose | ||
* `-q`, `--quiet` - Be quiet | ||
|
||
## Examples | ||
|
||
The command supports moving files and directories within the same bucket: | ||
|
||
## Notes | ||
* When using Studio cloud auth mode, you must be authenticated with `datachain auth login` before using it | ||
* The default mode operates directly with storage providers | ||
* **Warning**: This is a destructive operation. Always double-check the path before executing the command | ||
|
||
### Move Single File | ||
|
||
```bash | ||
# Move file | ||
datachain mv gs://my-bucket/data/file.py gs://my-bucket/archive/file.py | ||
|
||
# Move file with Studio cloud auth | ||
datachain mv gs://my-bucket/data/file.py gs://my-bucket/archive/file.py --studio-cloud-auth | ||
``` | ||
|
||
### Move Directory Recursively | ||
|
||
```bash | ||
# Move directory | ||
datachain mv gs://my-bucket/data/directory gs://my-bucket/archive/directory --recursive | ||
|
||
# Move directory with Studio cloud auth | ||
datachain mv gs://my-bucket/data/directory gs://my-bucket/archive/directory --recursive --studio-cloud-auth | ||
``` | ||
|
||
### Additional Examples | ||
|
||
```bash | ||
# Move a file to a different team's storage: | ||
datachain mv -s --team other-team gs://my-bucket/data/file.py gs://my-bucket/backup/file.py | ||
``` | ||
|
||
|
||
## Supported Storage Protocols | ||
|
||
The command supports the following storage protocols: | ||
- **AWS S3**: `s3://bucket-name/path` | ||
- **Google Cloud Storage**: `gs://bucket-name/path` | ||
- **Azure Blob Storage**: `az://container-name/path` | ||
|
||
## Limitations and Edge Cases | ||
- **Cannot move between different buckets**: The source and destination must be in the same bucket. Attempting to move between different buckets will result in an error: "Cannot move between different buckets" | ||
|
||
## Notes | ||
* When using Studio cloud auth mode, you must be authenticated with `datachain auth login` before using it | ||
* The default mode operates directly with storage providers |
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,57 @@ | ||
# rm | ||
|
||
Delete storage files and directories from cloud or local system. | ||
|
||
## Synopsis | ||
|
||
```usage | ||
usage: datachain rm [-h] [-v] [-q] [--recursive] [--team TEAM] [-s] path | ||
``` | ||
|
||
## Description | ||
|
||
This command deletes files and directories within storage. The command supports both individual files and directories, with the `--recursive` flag required for deleting directories. This is a destructive operation that permanently removes files and cannot be undone. | ||
|
||
## Arguments | ||
|
||
* `path` - Path to the storage file or directory to delete | ||
|
||
## Options | ||
|
||
* `--recursive` - Delete recursively | ||
* `--team TEAM` - Team name to use the credentials from. (Default: from config) | ||
* `-s`, `--studio-cloud-auth` - Use credentials from Studio for cloud operations (Default: False) | ||
* `-h`, `--help` - Show the help message and exit | ||
* `-v`, `--verbose` - Be verbose | ||
* `-q`, `--quiet` - Be quiet | ||
|
||
|
||
## Notes | ||
* When using Studio cloud auth mode, you must be authenticated with `datachain auth login` before using it | ||
* The default mode operates directly with storage providers | ||
* **Warning**: This is a destructive operation. Always double-check the path before executing the command | ||
|
||
|
||
## Examples | ||
|
||
The command supports deleting files and directories: | ||
|
||
### Delete Single File | ||
|
||
```bash | ||
# Delete file | ||
datachain rm gs://my-bucket/data/file.py --recursive | ||
|
||
# Delete file with Studio cloud auth | ||
datachain rm gs://my-bucket/data/file.py --studio-cloud-auth | ||
``` | ||
|
||
### Delete Directory Recursively | ||
|
||
```bash | ||
# Delete directory | ||
datachain rm gs://my-bucket/data/directory --recursive | ||
|
||
# Delete directory with Studio cloud auth | ||
datachain rm gs://my-bucket/data/directory --recursive --studio-cloud-auth | ||
``` |
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,10 @@ | ||
from .local import LocalCredentialsBasedFileHandler | ||
from .studio import StudioAuthenticatedFileHandler | ||
from .utils import build_file_paths, validate_upload_args | ||
|
||
__all__ = [ | ||
"LocalCredentialsBasedFileHandler", | ||
"StudioAuthenticatedFileHandler", | ||
"build_file_paths", | ||
"validate_upload_args", | ||
] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Probably I'm missing something, but what was the motivation to develop this on our own in
datachain
CLI when there are already some tools doing the same thing and are more specialized (e.g https://github.com/rclone/rclone)?To me this almost pollutes our CLI / API as it adds completely different domain that doesn't look related to datasets (and other related things) .. not to mention it's added at root level of CLI.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I believe the main requirements are these two:
For that, this is used.