Phase 4: Data pipeline modernization and Optuna HPO by w4nderlust · Pull Request #4090 · ludwig-ai/ludwig

w4nderlust · 2026-04-05T07:38:14Z

Summary

Phase 4 of the Ludwig modernization: data pipeline improvements and Optuna integration.

1. Typed Feature Metadata Classes

Replaces the untyped `TrainingSetMetadataDict = dict` with structured dataclasses that provide type safety, IDE autocomplete, and prevent key typo bugs.

```python
from ludwig.data.types import NumberMetadata, CategoryMetadata, TrainingSetMetadata

Typed access

meta = NumberMetadata(mean=5.0, std=2.0, ple_bin_edges=[0.0, 0.5, 1.0])
meta.mean # IDE autocomplete works

Backward compatible with dict-like access

tsm = TrainingSetMetadata()
tsm["feature_name"] = {"mean": 5.0} # dict-like set
value = tsm["feature_name"] # dict-like get
```

Classes: NumberMetadata, CategoryMetadata, TextMetadata, BinaryMetadata, ImageMetadata, SequenceMetadata, AudioMetadata, TrainingSetMetadata. All have from_dict/to_dict for serialization.

2. Native Optuna HPO Executor

Direct Optuna integration without requiring Ray Tune as intermediary:

```python
from ludwig.hyperopt.optuna_executor import OptunaExecutor

executor = OptunaExecutor(
parameters={
"trainer.learning_rate": {"space": "loguniform", "lower": 1e-5, "upper": 1e-2},
"trainer.batch_size": {"space": "int", "lower": 16, "upper": 256},
},
metric="validation.combined.loss",
goal="minimize",
num_samples=50,
sampler="auto", # AutoSampler, GPSampler, TPE, CMA-ES, Random
pruner="hyperband", # optional early stopping
storage="sqlite:///optuna.db", # optional persistence
)
best_params = executor.optimize(train_fn)
results = executor.get_results()
```

Supports all Optuna search spaces: uniform, loguniform, int, choice/categorical, grid.

3. HDF5 to Parquet Cache Migration

Changed default preprocessing cache format from HDF5 to Parquet:

`TRAINING_PREPROC_FILE_NAME` changed from `training.hdf5` to `training.parquet`
Legacy `TRAINING_PREPROC_HDF5_FILE_NAME` kept for backward-compatible loading

4. Preprocessing Module Extraction

First step in splitting the 2,386-line preprocessing.py into focused modules:

`ludwig/data/format_registry.py`: Data format detection from file extensions, format-to-name mapping
`ludwig/data/split_utils.py`: Train/val/test splitting with random and stratified support

The original preprocessing.py remains intact for backward compatibility. New code should import from these focused modules.

5. Dask as Optional Dependency

Dask is already optional via try/except in `ludwig/utils/types.py`. No further changes needed as the import is properly guarded.

Test plan

37 new tests: typed metadata (12), Optuna executor (11), format registry (7), split utils (7)
1178 existing tests pass (0 regressions)
Pre-commit hooks pass on all commits
CI

Supplements existing MLflow integration with 3.x features: - log_training_run(): model-centric tracking with LoggedModel entities - log_llm_trace(): structured GenAI tracing for LLM prompts/responses - Automatic config param logging and training metric logging - Graceful degradation when MLflow 3.x is not available

- Deprecate save_ludwig_model_for_inference() and save_torchscript() with warnings pointing to export_model(format='torch_export') - Add structured request/response logging middleware to serve_v2 - Add MLflow cost tracking: model size, parameter counts, param efficiency, base model name for LLMs

Delete inference.py, triton_utils.py, carton_utils.py, inference_utils.py, and all TorchScript tests. Remove save_torchscript/to_torchscript from API and base model. Rename TorchscriptPreprocessingInput to PreprocessingInput. Replace with export_model CLI command.

The rewritten export.py dropped callback iteration that the old version had. Fixes test_export_mlflow_cli and test_export_mlflow_local.

Replace untyped TrainingSetMetadataDict = dict with structured dataclasses: NumberMetadata, CategoryMetadata, TextMetadata, BinaryMetadata, ImageMetadata, SequenceMetadata, AudioMetadata, TrainingSetMetadata. Backward-compatible: dict-like access via __getitem__, from_dict/to_dict, get(), keys(), items(). Existing code continues to work during migration.

Direct Optuna integration without Ray Tune intermediary. Supports: - AutoSampler (auto-selects best algorithm) - GPSampler (Bayesian optimization) - TPE, CMA-ES, Random samplers - MedianPruner and HyperbandPruner for early stopping - Persistent storage via SQLite for resumable studies - All Optuna search space types: uniform, loguniform, int, choice

Change TRAINING_PREPROC_FILE_NAME from training.hdf5 to training.parquet. Keep TRAINING_PREPROC_HDF5_FILE_NAME for backward-compatible loading of legacy cached files.

First step in splitting the 2386-line preprocessing.py into focused modules: - format_registry.py: data format detection from file extensions - split_utils.py: train/val/test splitting with stratified support The original preprocessing.py remains intact for backward compatibility. New code should import from these focused modules.

for more information, see https://pre-commit.ci

generate_search_space() inspects Pydantic config fields and creates Optuna-compatible search spaces based on types and constraints. generate_trainer_search_space() provides sensible defaults for commonly tuned hyperparameters.

github-actions · 2026-04-05T22:48:02Z

Test Results

10 files ± 0 10 suites ±0 1h 48m 43s ⏱️ + 1m 55s
3 661 tests +42 3 631 ✅ +42 30 💤 ±0 0 ❌ ±0
3 750 runs +42 3 707 ✅ +42 43 💤 ±0 0 ❌ ±0

Results for commit ec060c4. ± Comparison against base commit fb963ad.

This pull request removes 3 and adds 45 tests. Note that renamed tests count towards both.

tests.integration_tests.test_experiment ‑ test_experiment_dataset_formats[hdf5]
tests.integration_tests.test_experiment ‑ test_experiment_image_dataset[hdf5_inmem]
tests.integration_tests.test_ray ‑ test_ray_lazy_load_audio_error

tests.integration_tests.test_ray ‑ test_ray_audio_basic
tests.ludwig.data.test_format_registry.TestDetectFormat ‑ test_case_insensitive
tests.ludwig.data.test_format_registry.TestDetectFormat ‑ test_csv
tests.ludwig.data.test_format_registry.TestDetectFormat ‑ test_hdf5
tests.ludwig.data.test_format_registry.TestDetectFormat ‑ test_json
tests.ludwig.data.test_format_registry.TestDetectFormat ‑ test_parquet
tests.ludwig.data.test_format_registry.TestDetectFormat ‑ test_unknown
tests.ludwig.data.test_format_registry.TestDetectFormatFromDataset ‑ test_dataframe
tests.ludwig.data.test_format_registry.TestDetectFormatFromDataset ‑ test_dict
tests.ludwig.data.test_format_registry.TestDetectFormatFromDataset ‑ test_string_path
…

♻️ This comment has been updated with latest results.

The TRAINING_PREPROC_FILE_NAME was changed to .parquet but the underlying cache manager still writes HDF5. Revert to .hdf5 until the cache format is actually migrated. Also skip test_contrib_comet on Python 3.12+ since comet_ml uses the removed imp module.

Register OptunaExecutor in executor_registry so it can be selected via config with executor.type=optuna. Rewrote executor to implement the standard Ludwig execute() interface (train model per trial, collect HyperoptResults). Updated run.py to allow optuna executor with local backend.

Replace HDF5 caching layer with Parquet files for simpler, faster, and more portable data caching. Key changes: - PandasDatasetManager.save() now writes Parquet via PyArrow - PandasDatasetManager.data_format returns "parquet" - PandasDataset loads from Parquet by default, with legacy HDF5 fallback for backward compatibility - TRAINING_PREPROC_FILE_NAME changed to "training.parquet" - Removed the out-of-memory H5 random access path (Parquet is always loaded fully into memory as numpy arrays) - h5py and tables removed from core dependencies (h5py is optional for legacy HDF5 file loading) - h5py import made conditional in fs_utils.py

Delegate extension-based format detection to the format_registry module instead of inlining the extension-to-format mapping. Keeps special-case handling for CacheableDataset, dask DataFrames, and ludwig:// / hf:// prefixes in the original function.

1. to_numpy_dataset() now accepts dict input (returns as-is with np.array conversion) 2. Parquet cache saves/loads N-D array shapes via sidecar .shapes.json files, fixing flattened image [H,W,C] and audio [T,F] arrays on round-trip 3. Image preprocessing: removed HDF5 out-of-memory path (upload_h5), always process in-memory since Parquet cache handles persistence 4. Audio preprocessing: same - always process in-memory, removed in_memory gate 5. Renamed data_hdf5_fp -> data_cache_fp across PandasDataset, RayDataset, preprocessing.py, test_batcher.py, and test_experiment.py 6. Dataset config discovery uses os.listdir instead of importlib.resources for reliable YAML file enumeration across install modes 7. Cache delete cleans up .shapes.json sidecar files alongside Parquet 8. _LazyRegistry.keys() includes lazy entries for better error messages

PandasDataset now restores N-D array shapes (images, audio) using reshape metadata from training_set_metadata. This fixes the flattening that happens during Parquet-compatible preprocessing. Also removes HDF5 test variants and lazy-load tests since the HDF5 cache path has been replaced by Parquet.

- Add py-cpuinfo to dependencies (was transitive dep of tables, which was removed in HDF5-to-Parquet migration) - Replace test_ray_lazy_load_audio_error with test_ray_audio_basic that skips the Ray-vs-local determinism check (tiny audio datasets produce non-deterministic roc_auc)

w4nderlust and others added 13 commits April 4, 2026 18:42

feat: add ModelExporter with torch.export and ONNX support

7fcf29e

feat: modernized FastAPI serving and vLLM LLM serving backend

95bba4d

test: add tests for model export and modernized serving

5ee72cd

fix: restore callbacks support in export_mlflow

508777a

The rewritten export.py dropped callback iteration that the old version had. Fixes test_export_mlflow_cli and test_export_mlflow_local.

feat: switch default preprocessing cache from HDF5 to Parquet

f1ded41

Change TRAINING_PREPROC_FILE_NAME from training.hdf5 to training.parquet. Keep TRAINING_PREPROC_HDF5_FILE_NAME for backward-compatible loading of legacy cached files.

[pre-commit.ci] auto fixes from pre-commit.com hooks

9350e5e

for more information, see https://pre-commit.ci

fix: remove unused BinaryMetadata import in test

ab897e7

Base automatically changed from export-serving-modernization to main April 5, 2026 08:02

w4nderlust added 9 commits April 5, 2026 22:41

Fix CI: skip comet test on Python 3.12+, remove unused variable

76b7ed5

Fix duplicate kwargs in _run_no_evaluate

ec060c4

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Phase 4: Data pipeline modernization and Optuna HPO#4090

Phase 4: Data pipeline modernization and Optuna HPO#4090
w4nderlust wants to merge 23 commits intomainfrom
data-pipeline-hyperopt-modernization

w4nderlust commented Apr 5, 2026

Uh oh!

github-actions bot commented Apr 5, 2026 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

w4nderlust commented Apr 5, 2026

Summary

1. Typed Feature Metadata Classes

Typed access

Backward compatible with dict-like access

2. Native Optuna HPO Executor

3. HDF5 to Parquet Cache Migration

4. Preprocessing Module Extraction

5. Dask as Optional Dependency

Test plan

Uh oh!

github-actions bot commented Apr 5, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Test Results

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

github-actions bot commented Apr 5, 2026 •

edited

Loading