Skip to content
Closed
Show file tree
Hide file tree
Changes from 8 commits
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
24 changes: 24 additions & 0 deletions docs/assessment/scheduler/README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,24 @@
# Installing the Usage Collection Scheduler

## Why do we need a usage collection scheduler?
- Some data warehouses do not preserve query history for longer than a few days or are limited to only a few rows. As an example, Amazon Redshift only preserves usage history for around 2-5 days
- Having adequate usage history is critical to provide an accurate Total Cost of Ownership (TCO) estimate
- Having too little history could produce an inaccurate representation of data warehouse query patterns and system utilization

## Supported operating systems
The usage collection scheduler can be installed on the following operating systems:

1. MacOS
2. Linux

## Local vs. Remote Deployments

The usage collection scheduler can be executed in the following deployments:

1. **Local scheduler** - Scheduler is installed on a local laptop or workstation, provided that the local system can connect to a bastion host and forward connectivity to the data warehouse
2. **Remote scheduler** - Scheduler is installed directly on a bastion host in a separate subnet or a compute instance deployed within the data warehouse's subnet


## Automatic restart

The usage collection scheduler utilizes the host operating system to run as a background process. As a result, the process will automatically resume processing after a system restart; no intervention is needed during a host system restart. The scheduler will automatically detect usage history gaps and proactively backfill when possible.
68 changes: 68 additions & 0 deletions docs/assessment/scheduler/linux/README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,68 @@
# (Linux) Installing the Usage Collection Scheduler

## Installation Steps:

1. Open a new Terminal window and navigate to the remorph repository

2. Copy the unit file to `/etc/systemd/system/`:

```bash
$ cp ./src/databricks/labs/remorph/assessments/scheduler/install/linux/remorph_usage_collection.service /etc/systemd/system/
```

3. Copy the timer file to `/etc/systemd/system/`:

```bash
$ cp ./src/databricks/labs/remorph/assessments/scheduler/install/linux/remorph_usage_collection.timer /etc/systemd/system/
```

4. Reload the systmd process to pick up the new files:

```bash
$ sudo systemctl daemon-reload
```

5. Enable the system timer

```bash
$ sudo systemctl enable remorph_usage_collection.timer
```

6. Start the system timer

```bash
$ sudo systemctl start remorph_usage_collection.timer
```

## Description of the Unit File Elements

1. **Description**: A description of the service

2. **After**: Ensures the service starts only after the specified system service is up (network in this case)

3. **ExecStart**: Specifies the command and arguments (python3 interpreter and the script path)

4. **WorkingDirectory**: The directory where the script should run

5. **StandardOutput**: Redirects the standard out to the system logs

6. **StandardError**: Redirects the error out to the system logs

7. **Restart**: Restarts the script if the system reboots

8. **User**: Runs the script as the specified user

9. **WantedBy**: Ensures the service starts when the system reaches multi-user mode


## Description of the Timer File Elements

1. **Description**: A description of the timer

2. **Persistent** : Ensures the script runs after reboots

3. **OnBootSec**: Ensures the script runs 1 min after bootin

4. **OnUnitActiveSec**: Runs the script every 15 minutes

5. **WantedBy**: Ensures the timer is activated when the system is running timers
37 changes: 37 additions & 0 deletions docs/assessment/scheduler/macos/README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,37 @@
# (MacOS) Installing the Usage Collection Scheduler

## Installation Steps:

1. Open a new Terminal window and navigate to the remorph repository

2. Copy the plist file to `~/Library/LaunchAgents/` for a user-specific task or `/Library/LaunchDaemons/` for a system-wide task:

```bash
$ cp ./src/databricks/labs/remorph/assessments/scheduler/install/macos/com.remorph.usagecollection.plist ~/Library/LaunchAgents/
```

3. Next, load the process by executing the following command:

```bash
$ launchctl load ~/Library/LaunchAgents/com.remorph.usagecollection.plist
```

4. Grant the usage collection script with execution permissions:

```bash
$ chmod +x ./src/databricks/labs/remorph/assessments/scheduler/usage_collector.py
```

## Description of the `plist` Elements

1. **Label**: A unique identifier for the job

2. **ProgramArguments**: Specifies the command and arguments (python3 interpreter and the script path)

3. **StartInterval**: Runs the script every 900 seconds (15 minutes)

4. **RunAtLoad**: Ensures the script runs immediately after loading

5. **StandardOutPath**: Logs output to `~/Downloads/remorph/scheduler/stdout` for debugging

6. **StandardErrorPath**: Logs errors to `~/Downloads/remorph/scheduler/stderr` for debugging
1 change: 1 addition & 0 deletions pyproject.toml
Original file line number Diff line number Diff line change
Expand Up @@ -63,6 +63,7 @@ dependencies = [
"mypy~=1.10.0",
"numpy==1.26.4",
"pandas==1.4.1",
"snowflake-sqlalchemy>=1.7.3"
]

[project.entry-points.databricks]
Expand Down
36 changes: 27 additions & 9 deletions src/databricks/labs/remorph/assessments/pipeline.py
Original file line number Diff line number Diff line change
Expand Up @@ -5,6 +5,7 @@
import subprocess
import yaml
import duckdb
from enum import StrEnum

from databricks.labs.remorph.connections.credential_manager import cred_file

Expand All @@ -17,6 +18,12 @@
DB_NAME = "profiler_extract.db"


class StepExecutionStatus(StrEnum):
COMPLETE = "COMPLETE"
ERROR = "ERROR"
SKIPPED = "SKIPPED"


class PipelineClass:
def __init__(self, config: PipelineConfig, executor: DatabaseManager):
self.config = config
Expand All @@ -28,16 +35,26 @@ def execute(self):
for step in self.config.steps:
if step.flag == "active":
logging.debug(f"Executing step: {step.name}")
self._execute_step(step)
self.execute_step(step)
logging.info("Pipeline execution completed")

def _execute_step(self, step: Step):
if step.type == "sql":
self._execute_sql_step(step)
elif step.type == "python":
self._execute_python_step(step)
else:
logging.error(f"Unsupported step type: {step.type}")
def execute_step(self, step: Step) -> StrEnum:
try:
if step.type == "sql":
logging.info(f"Executing SQL step {step.name}")
self._execute_sql_step(step)
status = StepExecutionStatus.COMPLETE
elif step.type == "python":
logging.info(f"Executing Python step {step.name}")
self._execute_python_step(step)
status = StepExecutionStatus.COMPLETE
else:
logging.error(f"Unsupported step type: {step.type}")
raise RuntimeError(f"Unsupported step type: {step.type}")
except RuntimeError as e:
logging.error(e)
status = StepExecutionStatus.ERROR
return status

def _execute_sql_step(self, step: Step):
logging.debug(f"Reading query from file: {step.extract_source}")
Expand Down Expand Up @@ -86,7 +103,8 @@ def _save_to_db(self, result, step_name: str, mode: str, batch_size: int = 1000)
self._create_dir(self.db_path_prefix)
conn = duckdb.connect(str(self.db_path_prefix) + '/' + DB_NAME)
columns = result.keys()
# TODO: Add support for figuring out data types from SQLALCHEMY result object result.cursor.description is not reliable
# TODO: Add support for figuring out data types from SQLALCHEMY
# result object result.cursor.description is not reliable
schema = ' STRING, '.join(columns) + ' STRING'

# Handle write modes
Expand Down
Original file line number Diff line number Diff line change
@@ -0,0 +1,14 @@
[Unit]
Description=Remorph usage collection service
After=network.target

[Service]
ExecStart=/usr/bin/python3 ~/Downloads/remorph/scheduler/usage_collector.py
WorkingDirectory=~/Downloads/remorph/scheduler/
StandardOutput=journal
StandardError=journal
Restart=always
User=root

[Install]
WantedBy=multi-user.target
Original file line number Diff line number Diff line change
@@ -0,0 +1,14 @@
[Unit]
Description=Remorph usage collection every 15 minutes
After=network.target

[Service]
ExecStart=/usr/bin/python3 ~/Downloads/remorph/scheduler/usage_collector.py
WorkingDirectory=~/Downloads/remorph/scheduler/
StandardOutput=journal
StandardError=journal
Restart=always
User=root

[Install]
WantedBy=multi-user.target
Original file line number Diff line number Diff line change
@@ -0,0 +1,27 @@
<?xml version="1.0" encoding="UTF-8"?>
<!DOCTYPE plist PUBLIC "-//Apple//DTD PLIST 1.0//EN" "http://www.apple.com/DTDs/PropertyList-1.0.dtd">
<plist version="1.0">
<dict>
<key>Label</key>
<string>com.remorph.usagecollection</string>

<key>ProgramArguments</key>
<array>
<string>/Library/Frameworks/Python.framework/Versions/3.11/bin/python3</string>
<string>~/Downloads/remorph/scheduler/usage_collector.py</string>
</array>

<key>StartInterval</key>
<integer>900</integer>

<key>RunAtLoad</key>
<true/>

<key>StandardOutPath</key>
<string>~/Downloads/remorph/scheduler/stdout/usagecollection.log</string>

<key>StandardErrorPath</key>
<string>~/Downloads/remorph/scheduler/stderr/usagecollection_error.log</string>
</dict>
</plist>

Loading
Loading