diff --git a/docs/lakebridge/docs/overview.mdx b/docs/lakebridge/docs/overview.mdx index 2926202ccf..24cb0b3bb1 100644 --- a/docs/lakebridge/docs/overview.mdx +++ b/docs/lakebridge/docs/overview.mdx @@ -45,10 +45,11 @@ For migrating your SQL workloads we provide transpilers that can: - Translate SQL code from a variety of source platforms to Databricks SQL. - Translate some orchestration and ETL code to Databricks SQL. -Internally, Lakebridge can use two different transpilers: +Internally, Lakebridge can use three different transpilers: - *BladeBridge*, a mature transpiler that can handle a wide range of source dialects as well as some ETL/orchestration. - *Morpheus*, a next-generation transpiler that currently handles a smaller set of dialects, but includes experimental support for dbt. + - *Switch*, an LLM-powered transpiler that uses Large Language Models to convert SQL and other source formats to Databricks notebooks. The table below summarizes the source platforms that we currently support: diff --git a/docs/lakebridge/docs/transpile/overview.mdx b/docs/lakebridge/docs/transpile/overview.mdx index 75ea7810bb..ccc34c85a5 100644 --- a/docs/lakebridge/docs/transpile/overview.mdx +++ b/docs/lakebridge/docs/transpile/overview.mdx @@ -133,6 +133,20 @@ Whether you're migrating a single job or thousands, BladeBridge delivers predict More details about the BladeBridge converter [here](/docs/transpile/pluggable_transpilers/bladebridge_overview) +### Switch + +Switch is an LLM-powered Lakebridge transpiler that extends beyond SQL to convert various source formats into Databricks-compatible outputs. Using [Mosaic AI Model Serving](https://docs.databricks.com/aws/en/machine-learning/model-serving/), it understands code intent and semantics to handle complex transformations. + +Key characteristics: +- **LLM-powered extensibility** - Built-in prompts for multiple SQL dialects (T-SQL, Snowflake, Teradata, Oracle, etc.) and generic formats (Python, Scala, Airflow) +- **Custom prompt support** - Add YAML prompts to handle any source format not covered by built-in options +- **Flexible outputs** - Python notebooks with Spark SQL (default), SQL notebooks (experimental), or any text-based format +- **Databricks-native processing** - Executes as scalable Lakeflow Jobs with serverless compute +- **Model flexibility** - Choose any Model Serving endpoint to match your organization's requirements +- **Complex logic handling** - Excels at stored procedures and business logic beyond ANSI SQL/PSM standards + +Switch's LLM-based approach enables support for any source format through custom prompts. For built-in prompt details and custom prompt creation, see the [Switch documentation](/docs/transpile/pluggable_transpilers/switch). + ### Supported dialects | Source Platform | Converter: BladeBridge | Converter: Morpheus | Output: SQL | Output: SparkSql | Output: PySpark | Other features available | diff --git a/docs/lakebridge/docs/transpile/pluggable_transpilers/index.mdx b/docs/lakebridge/docs/transpile/pluggable_transpilers/index.mdx index 4090b18a6c..3feee06f68 100644 --- a/docs/lakebridge/docs/transpile/pluggable_transpilers/index.mdx +++ b/docs/lakebridge/docs/transpile/pluggable_transpilers/index.mdx @@ -64,10 +64,11 @@ custom: # this section is optional, it is passed to the transpiler at startup : # can be pretty much anything ``` -Databricks provides 2 transpilers: _Morpheus_, its AST-based transpiler, and _BladeBridge_, a pattern-based transpiler. +Databricks provides 3 transpilers: _Morpheus_, its AST-based transpiler, _BladeBridge_, a pattern-based transpiler, and _Switch_, an LLM-powered transpiler. These transpilers are installed by `Lakebridge` itself as part of running the `install-transpile` command, as follows: - the latest _Morpheus_ is fetched from [Maven Central](https://central.sonatype.com/), and installed at `.databricks/labs/remorph-transpilers/databricks-morph-plugin/`. - the latest _BladeBridge_ is fetched from [PyPi](https://pypi.org/), and installed at `.databricks/labs/remorph-transpilers/bladebridge/`. + - the latest _Switch_ is fetched from [PyPi](https://pypi.org/), and installed as an optional direct dependency (enabled with `--include-llm-transpiler true`). Installing 3rd party transpilers is the responsibility of their provider. diff --git a/docs/lakebridge/docs/transpile/pluggable_transpilers/switch.mdx b/docs/lakebridge/docs/transpile/pluggable_transpilers/switch.mdx new file mode 100644 index 0000000000..63e1c55dad --- /dev/null +++ b/docs/lakebridge/docs/transpile/pluggable_transpilers/switch.mdx @@ -0,0 +1,701 @@ +--- +title: Switch +sidebar_position: 5 +--- + +import CodeBlock from '@theme/CodeBlock'; + +**Switch** is a Lakebridge transpiler plugin that uses Large Language Models (LLMs) to convert SQL and other source formats into Databricks notebooks or generic files. Switch leverages [Mosaic AI Model Serving](https://docs.databricks.com/aws/en/machine-learning/model-serving/) to understand code intent and semantics, generating equivalent Python notebooks with Spark SQL or other target formats. + +This LLM-powered approach excels at converting complex stored procedures, business logic, and ETL workflows where context and intent matter more than syntactic transformation. While generated notebooks may require manual adjustments, they provide a valuable foundation for Databricks migration. + +--- + +## How Switch Works + +Switch operates through three key components that distinguish it from rule-based transpilers: + +### LLM-Powered Semantic Understanding +Instead of parsing rules, Switch uses [Mosaic AI Model Serving](https://docs.databricks.com/aws/en/machine-learning/model-serving/) to: +- Interpret code intent and business context beyond syntax +- Handle SQL dialects, programming languages, and workflow definitions +- Support complex logic patterns and proprietary extensions +- Enable extensible conversion through custom YAML prompts + +### Native Databricks Integration +Switch runs entirely within the Databricks workspace: +- **Jobs API**: Executes as scalable Databricks Jobs for batch processing +- **Model Serving**: Direct integration with Databricks LLM endpoints, with concurrent processing for multiple files +- **Delta Tables**: Tracks conversion progress and results + +### Flexible Output Formats +- **Notebooks**: Python notebooks containing Spark SQL (primary output) +- **Generic Files**: YAML workflows, JSON configurations, and other text formats +- **Experimental**: Additional SQL notebook output converted from generated Python notebooks + +--- + +## Requirements + +Before installing Switch, ensure your Databricks environment meets the following requirements: + +### User Entitlements + +Users installing and running Switch require **Workspace Access** entitlement. + +### Workspace Resources + +- **Serverless job compute** (used by default for Switch job execution) + - If unavailable: Manually configure the Switch job to use classic job compute with DBR 14.3 LTS or higher +- **Foundation Model API** enabled workspace + +### Unity Catalog Resources + +Switch requires a Databricks catalog, schema, and volume: +- **Catalog and Schema**: Store Delta tables for state management and conversion results +- **Volume**: Store uploaded input source files + +**Required permissions:** +- **If using existing resources**: `USE CATALOG`, `USE SCHEMA`, `CREATE TABLE`, `READ VOLUME`, `WRITE VOLUME` +- **If creating new resources**: Catalog/schema/volume creation permissions (Switch will create them automatically during installation) + +--- + +## Source Format Support + +Switch uses LLMs to convert arbitrary source formats through custom prompts, with built-in prompts for SQL dialects, programming languages, and workflow systems. + +### Built-in Prompts: SQL Dialects + +Convert SQL from various dialects to Databricks Python notebooks. + +| Source Technology | Source Systems | +|-------------|----------------| +| `mssql` | Microsoft SQL Server, Azure SQL Database, Azure SQL Managed Instance, Amazon RDS for SQL Server | +| `mysql` | MySQL, MariaDB, and MySQL-compatible services (including Amazon Aurora MySQL, RDS, Google Cloud SQL) | +| `netezza` | IBM Netezza | +| `oracle` | Oracle Database, Oracle Exadata, and Oracle-compatible services (including Amazon RDS) | +| `postgresql` | PostgreSQL and PostgreSQL-compatible services (including Amazon Aurora PostgreSQL, RDS, Google Cloud SQL) | +| `redshift` | Amazon Redshift | +| `snowflake` | Snowflake | +| `synapse` | Azure Synapse Analytics (dedicated SQL pools) | +| `teradata` | Teradata | + +### Built-in Prompts: Non-SQL Sources + +Convert non-SQL files to notebooks or other formats. + +| Source Technology | Source → Target | +|--------------|-----------------| +| `python` | Python Script → Databricks Python Notebook | +| `scala` | Scala Code → Databricks Python Notebook | +| `airflow` | Airflow DAG → Databricks Jobs YAML + Operator conversion guidance (SQL→sql_task, Python→notebook, etc.) | + +### Custom Prompts: Any Source Format + +Switch's LLM-based architecture supports additional conversion types through custom YAML conversion prompts, making it extensible beyond built-in options. + +For custom prompt creation, see the [Customizable Prompts](#customizable-prompts) section. + +--- + +## Installation + +Install Switch using the `install-transpile` command with the `--include-llm-transpiler` option: + +```bash +databricks labs lakebridge install-transpile --include-llm-transpiler true +``` + +During installation, you will be prompted to configure: + +1. **Catalog**: Catalog for state tables +2. **Schema**: Schema for storing conversion results +3. **Volume**: Volume for uploading input files +4. **Foundation Model**: Model serving endpoint (e.g., `databricks-claude-sonnet-4-5`) + +**Note**: If you specify non-existing catalog, schema, or volume, Switch will attempt to create them (requires appropriate creation permissions). + +The installation automatically: + +1. **Creates Databricks Job**: Configured job in your workspace for running conversions +2. **Uploads Notebooks**: Switch processing notebooks to workspace +3. **Saves Configuration**: Stores catalog, schema, volume, and foundation_model settings in Lakebridge config + +--- + +## Usage + +Use the `llm-transpile` command to run Switch conversions. The command takes local file paths as input and automatically uploads them to Unity Catalog Volumes before processing: + +```bash +databricks labs lakebridge llm-transpile \ + --input-source /local/path/to/sql \ + --output-folder /Workspace/path/to/notebooks \ + --source-tech snowflake +``` + +When executing the above command, the response will look like this: + +```json +{ + "transpiler": "switch", + "job_id": 12345, + "run_id": 67890, + "run_url": "https://your-workspace.databricks.com/#job/12345/run/67890" +} +``` + +### Operational Notes + +Switch operates differently from other Lakebridge transpilers: + +- **Local Input Paths**: Input files are read from your local filesystem and automatically uploaded to Unity Catalog Volumes +- **Workspace Output Paths**: Output is written to Databricks Workspace paths (e.g., `/Workspace/path/to/...`) +- **Jobs API Execution**: Switch runs as a Databricks Job in your workspace, not as a local process +- **Asynchronous by Default**: The command returns immediately with a job URL, allowing you to monitor progress in the Databricks workspace +- **Monitoring**: Use the returned job URL to track conversion progress and view logs + +--- + +## Configuration + +Switch provides flexible configuration through three tiers: command-line parameters for each execution, Lakebridge configuration for installation settings, and Switch configuration file for customizing conversion behavior. + +### Command-Line Parameters + +The `llm-transpile` command requires only three parameters: + +| Parameter | Description | Example | +|-----------|-------------|---------| +| `--input-source` | Local file system path containing files to convert (automatically uploaded to Volume) | `/local/path/to/sql` | +| `--output-folder` | Databricks workspace path for generated outputs (must start with `/Workspace/`) | `/Workspace/path/to/notebooks` | +| `--source-tech` | Source technology/dialect to convert from (see [Source Format Support](#source-format-support) for available options) | `snowflake`, `oracle`, `python`, `airflow`, etc. | + +### Lakebridge Configuration File + +These settings are configured during `install-transpile` and stored in Lakebridge's config file. + +**File location:** `/Workspace/Users/{user}/.lakebridge/config.yml` (under `transpiler_options` section) + +| Parameter | Description | Example | +|-----------|-------------|---------| +| `catalog` | Databricks Unity Catalog for Switch Delta tables and Volume | `main` | +| `schema` | Schema within the catalog for Switch Delta tables and Volume | `switch` | +| `volume` | Unity Catalog Volume for uploaded input source files | `switch_volume` | +| `foundation_model` | Model serving endpoint name for conversions | `databricks-claude-sonnet-4-5` | + +### Switch Configuration File + +Additional conversion parameters are managed in the Switch configuration file. You can edit this file directly in your workspace to customize Switch's conversion behavior. + +**File location:** `/Workspace/Users/{user}/.lakebridge/switch/resources/switch_config.yml` + +| Parameter | Description | Default Value | Available Options | +|-----------|-------------|---------------|-------------------| +| `target_type` | Output format type. `notebook` for Python notebooks with validation and error fixing, `file` for generic file formats. See [Conversion Flow Overview](#conversion-flow-overview) for processing differences. | `notebook` | `notebook`, `file` | +| `source_format` | Source file format type. `sql` performs SQL comment removal and whitespace compression preprocessing before conversion. `generic` processes files as-is without preprocessing. Preprocessing affects token counting and conversion quality. See [analyze_input_files](#analyze_input_files) for preprocessing details. | `sql` | `sql`, `generic` | +| `comment_lang` | Language for generated comments. | `English` | `English`, `Japanese`, `Chinese`, `French`, `German`, `Italian`, `Korean`, `Portuguese`, `Spanish` | +| `log_level` | Logging verbosity level. | `INFO` | `DEBUG`, `INFO`, `WARNING`, `ERROR` | +| `token_count_threshold` | Maximum tokens per file for processing. Files exceeding this limit are automatically excluded from conversion. Adjust based on your model's context window and conversion complexity. See [Token Management](#token-management) for detailed configuration guidelines and file splitting strategies. | `20000` | Any positive integer | +| `concurrency` | Number of parallel LLM requests for processing multiple files simultaneously. Higher values improve throughput but may hit rate limits. Default is optimized for Claude models. See [Performance Optimization](#performance-optimization) for scaling guidance and model-specific considerations. | `4` | Any positive integer | +| `max_fix_attempts` | Maximum number of automatic syntax error correction attempts per file. Each attempt sends error context back to the LLM for fixing. Set to 0 to skip automatic fixes. See [fix_syntax_with_llm](#fix_syntax_with_llm) for details on the error correction process. | `1` | 0 or any positive integer | +| `conversion_prompt_yaml` | Custom conversion prompt YAML file path. When specified, overrides the built-in prompt for the selected `--source-tech`, enabling support for additional source formats or specialized conversion requirements. See [Customizable Prompts](#customizable-prompts) for YAML structure and creation guide. | `null` | Full workspace path to YAML file | +| `output_extension` | File extension for output files when `target_type=file`. Required for non-notebook output formats like YAML workflows or JSON configurations. See [File Conversion Flow](#file-conversion-flow) for usage examples. | `null` | Any extension (e.g., `.yml`, `.json`) | +| `sql_output_dir` | (Experimental) When specified, triggers additional conversion of Python notebooks to SQL notebook format. This optional post-processing step may lose some Python-specific logic. See [convert_notebook_to_sql](#convert_notebook_to_sql-optional) for details on the SQL conversion process. | `null` | Full workspace path | +| `request_params` | Additional request parameters passed to the model serving endpoint. Use for advanced configurations like extended thinking mode or custom token limits. See [Model Requirements](#model-requirements) for configuration examples including Claude's extended thinking mode. | `null` | JSON format string (e.g., `{"max_tokens": 64000}`) | + +--- + +## Usage Examples + +### Example 1: SQL conversion using built-in prompt + +Convert Snowflake SQL to Databricks Python notebooks using the built-in Snowflake conversion prompt: + +```bash +databricks labs lakebridge llm-transpile \ + --input-source /local/path/to/sql \ + --output-folder /Workspace/path/to/notebooks \ + --source-tech snowflake +``` + +### Example 2: SQL conversion using custom prompt + +Convert SQL using a custom conversion prompt YAML file: + +First, edit `switch_config.yml` to specify your custom prompt (leave other parameters unchanged): +```yaml +conversion_prompt_yaml: "/Workspace/path/to/my_custom_prompt.yml" +``` + +Then run (note: when using a custom prompt, `--source-tech` can be any value since the custom prompt takes precedence): +```bash +databricks labs lakebridge llm-transpile \ + --input-source /local/path/to/sql \ + --output-folder /Workspace/path/to/notebooks \ + --source-tech oracle +``` + +### Example 3: Python script to Databricks notebook + +Convert Python scripts to Databricks Python notebooks: + +First, edit `switch_config.yml` to set `source_format: "generic"` (leave other parameters unchanged): +```yaml +source_format: "generic" +``` + +Then run: +```bash +databricks labs lakebridge llm-transpile \ + --input-source /local/path/to/scripts \ + --output-folder /Workspace/path/to/notebooks \ + --source-tech python +``` + +### Example 4: Airflow DAG to Databricks Jobs YAML + +Convert Airflow DAGs to Databricks Jobs YAML definitions: + +First, edit `switch_config.yml` (leave other parameters unchanged): +```yaml +source_format: "generic" +target_type: "file" +output_extension: ".yml" +``` + +Then run: +```bash +databricks labs lakebridge llm-transpile \ + --input-source /local/path/to/dags \ + --output-folder /Workspace/path/to/workflows \ + --source-tech airflow +``` + +--- + +## Customizable Prompts + +You can create custom conversion prompts for Switch to handle new SQL dialects or specialized conversion requirements. + +### Creating Custom Conversion Prompts + +To create a custom conversion prompt: + +1. **Create a YAML file** with the required structure +2. **Place it in your Databricks workspace** +3. **Specify the full path** in the `conversion_prompt_yaml` parameter + +Custom conversion prompts require two main sections: + +#### Required Structure + +Here's a simple example showing the basic structure of a custom conversion prompt: + +```yaml +system_message: | + Convert SQL code to Python code that runs on Databricks according to the following instructions: + + # Input and Output + - Input: A single SQL file containing one or multiple T-SQL statements + - Output: Python code with Python comments (in {comment_lang}) explaining the code + + ${common_python_instructions_and_guidelines} + + # Additional Instructions + 1. Convert SQL queries to spark.sql() format + 2. Add clear Python comments explaining the code + 3. Use DataFrame operations instead of loops when possible + 4. Handle errors using try-except blocks + +few_shots: +- role: user + content: | + SELECT name, age + FROM users + WHERE active = 1; +- role: assistant + content: | + # Get names and ages of active users + active_users = spark.sql(""" + SELECT name, age + FROM users + WHERE active = 1 + """) + display(active_users) +``` + +#### Key Elements + +**`system_message` Section**: +- Clear explanation of the conversion purpose +- Definition of input and output formats +- Additional instructions for specific conversions +- (Optional) Comment language specification (`{comment_lang}` will be replaced automatically by Switch) +- (Optional) Common instructions placeholder (`${common_python_instructions_and_guidelines}` will be replaced automatically by Switch with built-in conversion guidelines) + +**`few_shots` Section** (Optional but recommended): +- Include examples ranging from simple to complex cases +- Each example demonstrates specific patterns for LLM understanding +- Shows typical conversion patterns for your SQL dialect + +### Reference: Built-in YAML Files + +Switch includes built-in YAML configuration files for supported conversion types, including SQL dialects and other source formats. When creating custom prompts, these built-in configurations serve as excellent starting points - even for supported conversion types, customizing the default prompts based on your specific input patterns can significantly improve conversion accuracy. + +**Location**: You can find built-in YAML files in the `/Workspace/Users/{user}/.lakebridge/switch/resources/builtin_prompts/` directory within the Switch installation. These files demonstrate the proper structure and provide conversion-specific examples that you can adapt for your custom requirements. + +--- + +## Transpiler Selection Guide + +Lakebridge offers both LLM-powered and rule-based transpilers, each optimized for different conversion scenarios. + +### LLM-Powered Conversion (Switch) + +Switch is best suited for scenarios requiring semantic understanding and flexibility: +- **Complex logic requiring contextual understanding** - Stored procedures and business logic where intent matters more than syntax +- **Source formats not covered by rule-based transpilers** - Any SQL dialect or programming language through custom prompts +- **Extensible conversion through custom YAML prompts** - Adapt to proprietary or uncommon source formats +- **Python notebook output for SQL beyond ANSI SQL/PSM standards** - Complex transformations that benefit from procedural code + +### Rule-Based Conversion (BladeBridge & Morpheus) + +Rule-based transpilers excel in scenarios requiring consistency and speed: +- **Deterministic output with guaranteed syntax equivalence** - Every conversion produces the same predictable result +- **High-volume batch processing** - Efficiently handle thousands of files without API rate limits +- **Fast local execution without API dependencies** - Sub-minute processing with no external service calls +- **Production-grade SQL aligned with Databricks SQL evolution** - Leverages SQL Scripting, Stored Procedures, and latest DBSQL features + +### Key Decision Factors + +Consider these aspects when choosing between transpilers: +- **Complexity**: Switch excels at complex logic requiring contextual understanding; rule-based transpilers handle standard SQL patterns efficiently +- **Volume**: Rule-based transpilers process large volumes quickly; Switch is better for selective, complex conversions +- **Customization**: Switch adapts to any source format through prompts; rule-based transpilers support specific, well-defined dialects +- **Output format**: Switch primarily generates Python notebooks; rule-based transpilers produce SQL files + +### Best Practices for Switch + +When using Switch, follow these guidelines for optimal results: +1. **Start with representative sample files** - Test conversion quality before processing entire codebase +2. **Iteratively refine prompts based on results** - Adjust YAML prompts to improve accuracy for your specific patterns +3. **Review generated code before production use** - LLM outputs may require manual adjustments +4. **Document successful patterns for future reference** - Keep track of effective prompts and configurations for reuse + +--- + +## FAQ: Can Switch Support My Source System? + +A common question is whether Switch can handle sources beyond the built-in conversion types. The answer is: **try it!** + +Switch already supports various source formats including SQL dialects (MySQL, Snowflake, Oracle, etc.), programming languages (Python scripts, Scala code), and workflows (Airflow DAGs). + +**For SQL-based sources**: Creating a custom prompt YAML file should work well for most SQL dialects. Since LLMs understand SQL syntax patterns, you can typically achieve good results by: +- Starting with a similar built-in dialect's YAML as a template +- Adding specific syntax examples from your source system +- Testing and iterating based on results + +**Tips for efficient prompt creation:** +- **Quick baseline creation**: Feed built-in prompts and your source dialect's representative features to an advanced LLM to quickly generate a baseline YAML configuration +- **Dialect-specific patterns**: Reference open source projects like [SQLGlot dialects](https://github.com/tobymao/sqlglot/tree/main/sqlglot/dialects) for insights into dialect-specific transformation patterns + +**For other source formats**: Switch's LLM-based architecture means it can potentially handle various conversions beyond the built-in types. Modern LLMs have strong comprehension capabilities across many languages and formats. You can experiment by: +- Creating custom prompts that define your source format +- Providing clear conversion examples in the few-shots section +- Testing with representative source samples + +Rather than waiting for additional built-in examples, we encourage experimentation with custom prompts. The flexibility of LLM-based conversion means many use cases are possible with the right prompt engineering. + +--- + +## Databricks Implementation Details + +When you run Switch via the CLI, it executes as Databricks Jobs using a sophisticated multi-stage processing pipeline. This section covers the internal architecture and configuration options. + +## Processing Architecture + +Switch executes as a Databricks Job that runs the main orchestration notebook (`00_main`), which routes to specialized orchestrators that coordinate the conversion pipeline: + +### Main Orchestration +The **`00_main`** notebook serves as the entry point when Switch is executed via Databricks Jobs API. It: +- Validates all input parameters from the job configuration +- Routes execution to the appropriate orchestrator based on `target_type` (notebook or file output) +- Handles orchestrator results and displays final conversion summary + +### Conversion Flow Overview + +Switch supports two target types with different processing workflows. The main entry point (`00_main`) routes to the appropriate orchestrator based on the `target_type` parameter: + +```mermaid +flowchart TD + main["00_main (Entry Point)"] e1@==> decision{target_type?} + + decision e2@==>|notebook| notebookOrch["orchestrate_to_notebook (Full Processing Pipeline)"] + decision e3@==>|file| fileOrch["orchestrate_to_file (Simplified Processing Pipeline)"] + + notebookOrch e4@==> notebookFlow["7-Step Workflow: analyze → convert → validate → fix → split → export → sql_convert(optional)"] + fileOrch e5@==> fileFlow["3-Step Workflow: analyze → convert → export"] + + notebookFlow e6@==> notebookOutput[Python/SQL Notebooks] + fileFlow e7@==> fileOutput["Generic Files (YAML, JSON, etc.)"] + e1@{ animate: true } + e2@{ animate: true } + e3@{ animate: true } + e4@{ animate: true } + e5@{ animate: true } + e6@{ animate: true } + e7@{ animate: true } +``` + +### Notebook Conversion Flow + +For `target_type=notebook`, the `orchestrate_to_notebook` orchestrator executes a comprehensive 7-step processing pipeline: + +```mermaid +flowchart TD + orchestrator[orchestrate_to_notebook] e1@==>|sequentially calls| processing + + subgraph processing ["Notebook Processing Workflow"] + direction TB + analyze[analyze_input_files] e2@==> convert[convert_with_llm] + convert e3@==> validate[validate_python_notebook] + validate e4@==> fix[fix_syntax_with_llm] + fix e5@==> split[split_code_into_cells] + split e6@==> export[export_to_notebook] + export -.-> sqlExport["convert_notebook_to_sql
(Optional)"] + end + + processing <-->|Read & Write| table[Conversion Result Table] + + convert -.->|Uses| endpoint[Model Serving Endpoint] + fix -.->|Uses| endpoint + sqlExport -.->|Uses| endpoint + + export e7@==> notebooks[Python Notebooks] + sqlExport -.-> sqlNotebooks["SQL Notebooks
(Optional Output)"] + e1@{ animate: true } + e2@{ animate: true } + e3@{ animate: true } + e4@{ animate: true } + e5@{ animate: true } + e6@{ animate: true } + e7@{ animate: true } +``` + +### File Conversion Flow + +For `target_type=file`, the `orchestrate_to_file` orchestrator uses a simplified 3-step processing pipeline optimized for generic file output: + +```mermaid +flowchart TD + orchestrator[orchestrate_to_file] e1@==>|sequentially calls| processing + + subgraph processing ["File Processing Workflow"] + direction TB + analyze[analyze_input_files] e2@==> convert[convert_with_llm] + convert e3@==> export[export_to_file] + end + + processing <-->|Read & Write| table[Conversion Result Table] + + convert -.->|Uses| endpoint[Model Serving Endpoint] + + export e4@==> files["Generic Files (YAML, JSON, etc.)"] + e1@{ animate: true } + e2@{ animate: true } + e3@{ animate: true } + e4@{ animate: true } +``` + +**Key Differences:** +- **File conversion skips** syntax validation, error fixing, and cell splitting steps +- **Direct export** from converted content to specified file format with custom extension +- **Optimized** for non-notebook outputs like YAML workflows, JSON configurations, etc. + +--- + +## Processing Steps + +The following sections describe each processing step used in the workflows above: + +### analyze_input_files +Scans the input directory recursively and performs initial analysis. Stores all file contents, metadata, and analysis results in a timestamped Delta table. For SQL sources, creates preprocessed versions with comments removed and whitespace normalized in the table. Counts tokens using model-specific tokenizers (Claude uses ~3.4 characters per token, OpenAI and other models use tiktoken) to determine if files exceed the `token_count_threshold`. Files exceeding the threshold are excluded from conversion. + +### convert_with_llm +Loads conversion prompts (built-in or custom YAML) and sends file content to the configured model serving endpoint. Multiple files are processed concurrently (configurable, default: 4) for efficiency. The LLM transforms source code based on the conversion prompt, preserving business logic while adapting to Databricks patterns. For SQL sources, generates Python code with `spark.sql()` calls. For generic sources, adapts content to the specified target format. + +### validate_python_notebook +Performs syntax validation on the generated code. Python syntax is checked using `ast.parse()`, while SQL statements within `spark.sql()` calls are validated using Spark's `EXPLAIN` command. Any errors are recorded in the result table for potential fixing in the next step. + +### fix_syntax_with_llm +Attempts automatic error correction when syntax issues are detected. Sends error context back to the model serving endpoint, which suggests corrections. The validation and fix process repeats up to `max_fix_attempts` times (default: 1) until errors are resolved or the retry limit is reached. + +### split_code_into_cells +Transforms raw converted Python code into well-structured notebook cells. Analyzes code flow and dependencies, splitting content at logical boundaries like imports, function definitions, and major operations. Adds appropriate markdown cells for documentation and readability. + +### export_to_notebook +Creates Databricks-compatible `.py` notebooks in the specified output directory. Each notebook includes proper metadata, source file references, and any syntax check results as comments. Handles large files (up to 10MB) and preserves the original directory structure. + +### convert_notebook_to_sql (Optional) +When `sql_output_dir` is specified, this optional step uses the model serving endpoint to convert Python notebooks into SQL notebook format with Databricks SQL syntax. Useful for teams preferring SQL-only workflows, though some Python logic may be lost in the conversion process. + +--- + +## Model Requirements + +### Supported LLM Endpoints + +Switch offers flexible model selection - choose any Model Serving endpoint through configuration parameters. + +**Default Setup** +Pre-configured with an latest Claude model via Foundation Model API for immediate use (no configuration required). This ensures strong code comprehension and large context handling out of the box. + +**Model Options** +Switch works with Foundation Model APIs that have large context windows and strong code comprehension capabilities: + +- **Pay-per-token pricing**: For light to moderate usage +- **Provisioned Throughput**: For high-volume workloads with predictable cost and guaranteed performance + +**Advanced Configuration** +For complex code transformations and intricate business logic, Claude’s extended thinking mode can significantly improve conversion accuracy. This mode allows the model to reason through complex transformations more thoroughly, though it increases processing time and token usage. Configure via `request_params`: + +```json +{"max_tokens": 64000, "thinking": {"type": "enabled", "budget_tokens": 16000}} +``` + +### Token Management + +LLMs have limits on how much text they can process at once. Switch uses a configurable threshold approach to ensure stable processing. + +**Token Limits and Calculation:** +- **Default threshold**: 20,000 tokens per file +- **Token estimation**: + - Claude models: ~3.4 characters per token (20,000 tokens ≈ 68,000 characters) + - Other models: Uses tiktoken library with o200k_base encoding +- **Preprocessing**: SQL comments removed and whitespace compressed before counting + +**Threshold Configuration:** +- **Base configuration**: 20,000 tokens (tested with Claude 3.7 Sonnet, 128k context) +- **Adjustment guidelines**: + - Complex transformations with extended thinking: Lower threshold (e.g., 8,000 tokens) + - Larger context models (e.g., Claude Sonnet 4, 200k context): Higher threshold possible + +**Handling Oversized Files:** +When files exceed the threshold: +- Automatically excluded during analyze phase +- Marked with status "Not converted" +- Require manual splitting before processing + +**File Splitting Strategies:** +If your input files exceed the threshold, consider logical splitting points: +- Separate stored procedures into individual files +- Split by functional modules or business domains +- Maintain referential integrity across split files + +### Performance Optimization + +**Concurrency Settings:** +- **Default concurrency**: Set to 4 based on testing with Claude models for stable operation +- **Model-specific considerations**: Different foundation models have varying rate limits and optimal concurrency levels +- **Scaling for large workloads**: For processing many files simultaneously, consider: + - **Increased concurrency**: Test higher values with your chosen model to find optimal settings + - **Provisioned Throughput**: Deploy dedicated Foundation Model API capacity with guaranteed throughput + +**Monitoring:** +- Watch for rate limiting or throttling responses from model endpoints +- Consider enabling [Inference Tables](https://docs.databricks.com/aws/en/machine-learning/model-serving/inference-tables) to automatically capture requests and responses for detailed monitoring and debugging + +--- + +## State Management + +Switch uses a Delta table to track conversion progress and results. Each conversion job creates a timestamped table: `{catalog}.{schema}.lakebridge_switch_{timestamp}_{random}` (e.g., `main.default.lakebridge_switch_20250115143022_7ka9`) + +The table stores input file information (path, content, token counts), conversion results (generated notebooks, token usage, processing time), error details when conversions fail, and syntax check results from validation stages. This allows you to monitor which files were processed successfully and investigate any issues that occurred during conversion. + +### Conversion Result Table Schema + +Switch creates Delta tables with the following complete schema: + +| Column | Type | Description | +|--------|------|-------------| +| `input_file_number` | int | Unique integer identifier for each input file (starts from 1) | +| `input_file_path` | string | Full path to the input file | +| `input_file_encoding` | string | Detected encoding of the input file (e.g., UTF-8) | +| `tokenizer_type` | string | Type of tokenizer used (claude or openai) | +| `tokenizer_model` | string | Specific tokenizer model/encoding used | +| `input_file_token_count` | int | Total number of tokens in the input file | +| `input_file_token_count_preprocessed` | int | Token count of preprocessed content (SQL comments removed for SQL files, original count for generic files) | +| `input_file_content` | string | Entire content of the input file | +| `input_file_content_preprocessed` | string | Preprocessed content (SQL comments removed for SQL files, original content for generic files) | +| `is_conversion_target` | boolean | Whether file should be processed (updated during conversion) | +| `model_serving_endpoint_for_conversion` | string | Model endpoint used for conversion | +| `model_serving_endpoint_for_fix` | string | Model endpoint used for syntax error fixing | +| `request_params_for_conversion` | string | Conversion request parameters in JSON format | +| `request_params_for_fix` | string | Fix request parameters in JSON format | +| `result_content` | string | Generated notebook content (initially null) | +| `result_prompt_tokens` | int | Number of prompt tokens used (initially null) | +| `result_completion_tokens` | int | Number of completion tokens generated (initially null) | +| `result_total_tokens` | int | Total tokens used (prompt + completion, initially null) | +| `result_processing_time_seconds` | float | Processing time in seconds (initially null) | +| `result_timestamp` | timestamp | UTC timestamp when processing completed (initially null) | +| `result_error` | string | Any conversion errors encountered (initially null) | +| `result_python_parse_error` | string | Python syntax errors found using ast.parse (initially null) | +| `result_extracted_sqls` | array<string> | SQL statements extracted from Python code (initially null) | +| `result_sql_parse_errors` | array<string> | SQL syntax errors found using EXPLAIN (initially null) | +| `export_output_path` | string | Path to the exported file (initially null) | +| `export_status` | string | Export processing status (initially null) | +| `export_error` | string | Export error information (initially null) | +| `export_timestamp` | timestamp | UTC timestamp when export completed (initially null) | +| `export_content_size_bytes` | long | Size of exported content in bytes (initially null) | + +--- + +## Conversion Results and Troubleshooting + +### Understanding Conversion Results + +After your Switch job completes, review the conversion results displayed at the end of the `00_main` notebook execution. The results table shows the status of each input file: + +- **Successfully converted files**: Ready to use as Databricks notebooks +- **Files requiring attention**: May need manual review or re-processing + +If you encounter files that didn't convert successfully, here are the most common issues and their solutions: + +### Files Not Converting (Status: `Not converted`) + +These files were skipped during the conversion process, typically because they're too large for the model to process effectively. + +**Cause**: Input files exceed the token count threshold + +**Solutions**: +- Split large input files into smaller, more manageable parts +- Increase the `token_count_threshold` parameter if your LLM model can handle larger inputs + +### Conversion with Errors (Status: `Converted with errors`) + +These files were successfully processed by the LLM but the generated code contains syntax errors that need to be addressed. + +**Cause**: Files were converted but contain syntax errors + +**Solutions**: +- Review syntax error messages in the result table's error_details column +- Manually fix errors in the converted notebooks/files +- Increase `max_fix_attempts` for more automatic error correction attempts + +### Export Failures (Status: `Export failed` or `Not exported`) + +These files were converted successfully but couldn't be exported to the output directory. + +**Causes**: +- Content exceeds 10MB size limit of Databricks notebooks +- File system permissions issues +- Invalid output paths + +**Solutions**: +- Check the `export_error` column in the result table for specific error details +- For size issues: Manually split large converted content into smaller units +- For permission issues: Verify workspace access to the output directory +- For path issues: Ensure output directory paths are valid workspace locations