Skip to content
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
109 changes: 109 additions & 0 deletions docs/commands/cp.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,109 @@
# cp

Copy storage files and directories between cloud and local storage.

## Synopsis

```usage
usage: datachain cp [-h] [-v] [-q] [-r] [--team TEAM]
[-s] [--anon] [--update]
source_path destination_path
```

## Description

This command copies files and directories between local and/or remote storage. This uses the credentials in your system by default or can use the cloud authentication from Studio.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Probably I'm missing something, but what was the motivation to develop this on our own in datachain CLI when there are already some tools doing the same thing and are more specialized (e.g https://github.com/rclone/rclone)?

To me this almost pollutes our CLI / API as it adds completely different domain that doesn't look related to datasets (and other related things) .. not to mention it's added at root level of CLI.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I believe the main requirements are these two:

  1. To be able to track the changes made from CLI in studio file browser as requested by the consumer
  2. To be able to use the credentials stored in Studio to perform file operations

For that, this is used.


The command supports two main modes of operation:

- By default, the command operates directly with clouds using credentials in your system, supporting various copy scenarios between local and remote storage.
- When using `-s` or `--studio-cloud-auth` flag, the command uses credentials from Studio for cloud operations. This mode provides enhanced authentication and access control for cloud storage operations.


## Arguments

* `source_path` - Path to the source file or directory to copy
* `destination_path` - Path to the destination file or directory to copy to

## Options

* `-r`, `-R`, `--recursive` - Copy directories recursively
* `--team TEAM` - Team name to use the credentials from. (Default: from config)
* `-s`, `--studio-cloud-auth` - Use credentials from Studio for cloud operations (Default: False)
* `--anon` - Use anonymous access for cloud operations (Default: False)
* `--update` - Update cached list of files for the source when downloading from cloud using local credentials.
* `-h`, `--help` - Show the help message and exit
* `-v`, `--verbose` - Be verbose
* `-q`, `--quiet` - Be quiet


## Notes
* When using Studio cloud auth mode, you must be authenticated with `datachain auth login` before using it
* The default mode operates directly with storage providers


## Examples
### Local to Local

**Operation**: Direct local file system copy
- Uses the local filesystem's native copy operation
- Fastest operation as no network transfer is involved
- Supports both files and directories

```bash
datachain cp /path/to/source/file.py /path/to/destination/file.py
datachain cp -r /path/to/source/directory /path/to/destination/directory
```

### Local to Remote
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think we need so many examples ...

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think otherwise, There are exactly 4 different types of examples to show what are supported. Within each type, we have file and folders each using local and studio authentication.

If you think otherwise, let me know which examples to keep.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

just keep 2-3 examples, command and options should be more or less self descriptive

or make examples more comprehensive - e.g. do Studio auth, etc ...

again, otherwise we just repeat the option descritptions

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

just keep 2-3 examples, command and options should be more or less self descriptive

Which ones? Asking before making any changes to avoid back and forth.


**Operation**: Upload to cloud storage
- Uploads local files/directories to remote storage
- Supports both default mode and Studio cloud auth mode
- Requires `--recursive` flag for directories

```bash
# Upload single file
datachain cp /path/to/local/file.py gs://my-bucket/data/file.py

# Upload single file with Studio cloud auth
datachain cp /path/to/local/file.py gs://my-bucket/data/file.py --studio-cloud-auth

# Upload directory recursively
datachain cp --recursive /path/to/local/directory gs://my-bucket/data/

# Upload directory recursively with Studio cloud auth
datachain cp --recursive /path/to/local/directory gs://my-bucket/data/ --studio-cloud-auth
```

### Remote to Local

**Operation**: Download from cloud storage
- Downloads remote files/directories to local storage
- Automatically extracts filename if destination is a directory
- Creates destination directory if it doesn't exist

```bash
# Download single file
datachain cp gs://my-bucket/data/file.py /path/to/local/directory/

# Download single file with Studio cloud auth
datachain cp gs://my-bucket/data/file.py /path/to/local/directory/ --studio-cloud-auth

# Download directory recursively
datachain cp -r gs://my-bucket/data/directory /path/to/local/directory/
```

### Remote to Remote

**Operation**: Copy within cloud storage
- Copies files between locations between cloud storages
- Requires `--recursive` flag for directories

```bash
# Copy within same bucket
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is it possible to copy from one remote (bucket or even remote type) to another one?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes.

datachain cp gs://my-bucket/data/file.py gs://my-bucket/archive/file.py

# Copy within same bucket with Studio cloud auth
datachain cp gs://my-bucket/data/file.py gs://my-bucket/archive/file.py --studio-cloud-auth
```
79 changes: 79 additions & 0 deletions docs/commands/mv.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,79 @@
# mv

Move storage files and directories in clouds or local filesystem.

## Synopsis

```usage
usage: datachain mv [-h] [-v] [-q] [--recursive]
[--team TEAM] [-s] path new_path
```

## Description

This command moves files and directories within storage. The command supports both individual files and directories, with the `--recursive` flag required for moving directories.

## Arguments

* `path` - Path to the storage file or directory to move
* `new_path` - New path where the file or directory should be moved to

## Options

* `--recursive` - Move recursively
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is it a regular thing to do require --recursive for mv?

Copy link
Contributor Author

@amritghimire amritghimire Aug 6, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am not sure, we do have it with the call to fs.mv . LMK if we want to remove this.,

* `--team TEAM` - Team name to use the credentials from. (Default: from config)
* `-s`, `--studio-cloud-auth` - Use credentials from Studio for cloud operations (Default: False)
* `-h`, `--help` - Show the help message and exit
* `-v`, `--verbose` - Be verbose
* `-q`, `--quiet` - Be quiet

## Examples

The command supports moving files and directories within the same bucket:

## Notes
* When using Studio cloud auth mode, you must be authenticated with `datachain auth login` before using it
* The default mode operates directly with storage providers
* **Warning**: This is a destructive operation. Always double-check the path before executing the command

### Move Single File

```bash
# Move file
datachain mv gs://my-bucket/data/file.py gs://my-bucket/archive/file.py

# Move file with Studio cloud auth
datachain mv gs://my-bucket/data/file.py gs://my-bucket/archive/file.py --studio-cloud-auth
```

### Move Directory Recursively

```bash
# Move directory
datachain mv gs://my-bucket/data/directory gs://my-bucket/archive/directory --recursive

# Move directory with Studio cloud auth
datachain mv gs://my-bucket/data/directory gs://my-bucket/archive/directory --recursive --studio-cloud-auth
```

### Additional Examples

```bash
# Move a file to a different team's storage:
datachain mv -s --team other-team gs://my-bucket/data/file.py gs://my-bucket/backup/file.py
```


## Supported Storage Protocols

The command supports the following storage protocols:
- **AWS S3**: `s3://bucket-name/path`
- **Google Cloud Storage**: `gs://bucket-name/path`
- **Azure Blob Storage**: `az://container-name/path`

## Limitations and Edge Cases
- **Cannot move between different buckets**: The source and destination must be in the same bucket. Attempting to move between different buckets will result in an error: "Cannot move between different buckets"

## Notes
* When using Studio cloud auth mode, you must be authenticated with `datachain auth login` before using it
* The default mode operates directly with storage providers
57 changes: 57 additions & 0 deletions docs/commands/rm.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,57 @@
# rm

Delete storage files and directories from cloud or local system.

## Synopsis

```usage
usage: datachain rm [-h] [-v] [-q] [--recursive] [--team TEAM] [-s] path
```

## Description

This command deletes files and directories within storage. The command supports both individual files and directories, with the `--recursive` flag required for deleting directories. This is a destructive operation that permanently removes files and cannot be undone.

## Arguments

* `path` - Path to the storage file or directory to delete

## Options

* `--recursive` - Delete recursively
* `--team TEAM` - Team name to use the credentials from. (Default: from config)
* `-s`, `--studio-cloud-auth` - Use credentials from Studio for cloud operations (Default: False)
* `-h`, `--help` - Show the help message and exit
* `-v`, `--verbose` - Be verbose
* `-q`, `--quiet` - Be quiet


## Notes
* When using Studio cloud auth mode, you must be authenticated with `datachain auth login` before using it
* The default mode operates directly with storage providers
* **Warning**: This is a destructive operation. Always double-check the path before executing the command


## Examples

The command supports deleting files and directories:

### Delete Single File

```bash
# Delete file
datachain rm gs://my-bucket/data/file.py --recursive

# Delete file with Studio cloud auth
datachain rm gs://my-bucket/data/file.py --studio-cloud-auth
```

### Delete Directory Recursively

```bash
# Delete directory
datachain rm gs://my-bucket/data/directory --recursive

# Delete directory with Studio cloud auth
datachain rm gs://my-bucket/data/directory --recursive --studio-cloud-auth
```
3 changes: 3 additions & 0 deletions mkdocs.yml
Original file line number Diff line number Diff line change
Expand Up @@ -107,6 +107,9 @@ nav:
- cancel: commands/job/cancel.md
- ls: commands/job/ls.md
- clusters: commands/job/clusters.md
- rm: commands/rm.md
- mv: commands/mv.md
- cp: commands/cp.md
- 📚 User Guide:
- Overview: guide/index.md
- 📡 Interacting with remote storage: guide/remotes.md
Expand Down
63 changes: 54 additions & 9 deletions src/datachain/cli/__init__.py
Original file line number Diff line number Diff line change
Expand Up @@ -3,7 +3,7 @@
import sys
import traceback
from multiprocessing import freeze_support
from typing import Optional
from typing import TYPE_CHECKING, Optional

from datachain.cli.utils import get_logging_level
from datachain.error import DataChainError as DataChainError
Expand All @@ -25,6 +25,11 @@

logger = logging.getLogger("datachain")

if TYPE_CHECKING:
from argparse import Namespace

from datachain.catalog import Catalog as Catalog


def main(argv: Optional[list[str]] = None) -> int:
from datachain.catalog import get_catalog
Expand Down Expand Up @@ -97,6 +102,8 @@ def handle_command(args, catalog, client_config) -> int:
"gc": lambda: garbage_collect(catalog),
"auth": lambda: process_auth_cli_args(args),
"job": lambda: process_jobs_args(args),
"mv": lambda: handle_mv_command(args, catalog),
"rm": lambda: handle_rm_command(args, catalog),
}

handler = command_handlers.get(args.command)
Expand All @@ -109,15 +116,53 @@ def handle_command(args, catalog, client_config) -> int:
return 1


def handle_cp_command(args, catalog):
catalog.cp(
args.sources,
args.output,
force=bool(args.force),
update=bool(args.update),
recursive=bool(args.recursive),
no_glob=args.no_glob,
def _get_file_handler(args: "Namespace"):
from datachain.cli.commands.storage import (
LocalCredentialsBasedFileHandler,
StudioAuthenticatedFileHandler,
)
from datachain.config import Config

config = Config().read().get("studio", {})
token = config.get("token")
studio = False if not token else args.studio_cloud_auth
return (
StudioAuthenticatedFileHandler if studio else LocalCredentialsBasedFileHandler
)


def handle_cp_command(args, catalog):
file_handler = _get_file_handler(args)
return file_handler(
catalog=catalog,
team=args.team,
source_path=args.source_path,
destination_path=args.destination_path,
update=args.update,
recursive=args.recursive,
anon=args.anon,
).cp()


def handle_mv_command(args, catalog):
file_handler = _get_file_handler(args)
return file_handler(
catalog=catalog,
team=args.team,
path=args.path,
new_path=args.new_path,
recursive=args.recursive,
).mv()


def handle_rm_command(args, catalog):
file_handler = _get_file_handler(args)
return file_handler(
catalog=catalog,
team=args.team,
path=args.path,
recursive=args.recursive,
).rm()


def handle_clone_command(args, catalog):
Expand Down
10 changes: 10 additions & 0 deletions src/datachain/cli/commands/storage/__init__.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,10 @@
from .local import LocalCredentialsBasedFileHandler
from .studio import StudioAuthenticatedFileHandler
from .utils import build_file_paths, validate_upload_args

__all__ = [
"LocalCredentialsBasedFileHandler",
"StudioAuthenticatedFileHandler",
"build_file_paths",
"validate_upload_args",
]
Loading