Skip to content

Phase 4: Data pipeline modernization and Optuna HPO#4090

Open
w4nderlust wants to merge 23 commits intomainfrom
data-pipeline-hyperopt-modernization
Open

Phase 4: Data pipeline modernization and Optuna HPO#4090
w4nderlust wants to merge 23 commits intomainfrom
data-pipeline-hyperopt-modernization

Conversation

@w4nderlust
Copy link
Copy Markdown
Collaborator

Summary

Phase 4 of the Ludwig modernization: data pipeline improvements and Optuna integration.

1. Typed Feature Metadata Classes

Replaces the untyped `TrainingSetMetadataDict = dict` with structured dataclasses that provide type safety, IDE autocomplete, and prevent key typo bugs.

```python
from ludwig.data.types import NumberMetadata, CategoryMetadata, TrainingSetMetadata

Typed access

meta = NumberMetadata(mean=5.0, std=2.0, ple_bin_edges=[0.0, 0.5, 1.0])
meta.mean # IDE autocomplete works

Backward compatible with dict-like access

tsm = TrainingSetMetadata()
tsm["feature_name"] = {"mean": 5.0} # dict-like set
value = tsm["feature_name"] # dict-like get
```

Classes: NumberMetadata, CategoryMetadata, TextMetadata, BinaryMetadata, ImageMetadata, SequenceMetadata, AudioMetadata, TrainingSetMetadata. All have from_dict/to_dict for serialization.

2. Native Optuna HPO Executor

Direct Optuna integration without requiring Ray Tune as intermediary:

```python
from ludwig.hyperopt.optuna_executor import OptunaExecutor

executor = OptunaExecutor(
parameters={
"trainer.learning_rate": {"space": "loguniform", "lower": 1e-5, "upper": 1e-2},
"trainer.batch_size": {"space": "int", "lower": 16, "upper": 256},
},
metric="validation.combined.loss",
goal="minimize",
num_samples=50,
sampler="auto", # AutoSampler, GPSampler, TPE, CMA-ES, Random
pruner="hyperband", # optional early stopping
storage="sqlite:///optuna.db", # optional persistence
)
best_params = executor.optimize(train_fn)
results = executor.get_results()
```

Supports all Optuna search spaces: uniform, loguniform, int, choice/categorical, grid.

3. HDF5 to Parquet Cache Migration

Changed default preprocessing cache format from HDF5 to Parquet:

  • `TRAINING_PREPROC_FILE_NAME` changed from `training.hdf5` to `training.parquet`
  • Legacy `TRAINING_PREPROC_HDF5_FILE_NAME` kept for backward-compatible loading

4. Preprocessing Module Extraction

First step in splitting the 2,386-line preprocessing.py into focused modules:

  • `ludwig/data/format_registry.py`: Data format detection from file extensions, format-to-name mapping
  • `ludwig/data/split_utils.py`: Train/val/test splitting with random and stratified support

The original preprocessing.py remains intact for backward compatibility. New code should import from these focused modules.

5. Dask as Optional Dependency

Dask is already optional via try/except in `ludwig/utils/types.py`. No further changes needed as the import is properly guarded.

Test plan

  • 37 new tests: typed metadata (12), Optuna executor (11), format registry (7), split utils (7)
  • 1178 existing tests pass (0 regressions)
  • Pre-commit hooks pass on all commits
  • CI

w4nderlust and others added 13 commits April 4, 2026 18:42
Supplements existing MLflow integration with 3.x features:
- log_training_run(): model-centric tracking with LoggedModel entities
- log_llm_trace(): structured GenAI tracing for LLM prompts/responses
- Automatic config param logging and training metric logging
- Graceful degradation when MLflow 3.x is not available
- Deprecate save_ludwig_model_for_inference() and save_torchscript()
  with warnings pointing to export_model(format='torch_export')
- Add structured request/response logging middleware to serve_v2
- Add MLflow cost tracking: model size, parameter counts, param
  efficiency, base model name for LLMs
Delete inference.py, triton_utils.py, carton_utils.py, inference_utils.py,
and all TorchScript tests. Remove save_torchscript/to_torchscript from
API and base model. Rename TorchscriptPreprocessingInput to PreprocessingInput.
Replace with export_model CLI command.
The rewritten export.py dropped callback iteration that the old
version had. Fixes test_export_mlflow_cli and test_export_mlflow_local.
Replace untyped TrainingSetMetadataDict = dict with structured dataclasses:
NumberMetadata, CategoryMetadata, TextMetadata, BinaryMetadata,
ImageMetadata, SequenceMetadata, AudioMetadata, TrainingSetMetadata.

Backward-compatible: dict-like access via __getitem__, from_dict/to_dict,
get(), keys(), items(). Existing code continues to work during migration.
Direct Optuna integration without Ray Tune intermediary. Supports:
- AutoSampler (auto-selects best algorithm)
- GPSampler (Bayesian optimization)
- TPE, CMA-ES, Random samplers
- MedianPruner and HyperbandPruner for early stopping
- Persistent storage via SQLite for resumable studies
- All Optuna search space types: uniform, loguniform, int, choice
Change TRAINING_PREPROC_FILE_NAME from training.hdf5 to training.parquet.
Keep TRAINING_PREPROC_HDF5_FILE_NAME for backward-compatible loading of
legacy cached files.
First step in splitting the 2386-line preprocessing.py into focused modules:
- format_registry.py: data format detection from file extensions
- split_utils.py: train/val/test splitting with stratified support

The original preprocessing.py remains intact for backward compatibility.
New code should import from these focused modules.
Base automatically changed from export-serving-modernization to main April 5, 2026 08:02
generate_search_space() inspects Pydantic config fields and creates
Optuna-compatible search spaces based on types and constraints.
generate_trainer_search_space() provides sensible defaults for
commonly tuned hyperparameters.
@github-actions
Copy link
Copy Markdown

github-actions bot commented Apr 5, 2026

Test Results

   10 files  ± 0     10 suites  ±0   1h 48m 43s ⏱️ + 1m 55s
3 661 tests +42  3 631 ✅ +42  30 💤 ±0  0 ❌ ±0 
3 750 runs  +42  3 707 ✅ +42  43 💤 ±0  0 ❌ ±0 

Results for commit ec060c4. ± Comparison against base commit fb963ad.

This pull request removes 3 and adds 45 tests. Note that renamed tests count towards both.
tests.integration_tests.test_experiment ‑ test_experiment_dataset_formats[hdf5]
tests.integration_tests.test_experiment ‑ test_experiment_image_dataset[hdf5_inmem]
tests.integration_tests.test_ray ‑ test_ray_lazy_load_audio_error
tests.integration_tests.test_ray ‑ test_ray_audio_basic
tests.ludwig.data.test_format_registry.TestDetectFormat ‑ test_case_insensitive
tests.ludwig.data.test_format_registry.TestDetectFormat ‑ test_csv
tests.ludwig.data.test_format_registry.TestDetectFormat ‑ test_hdf5
tests.ludwig.data.test_format_registry.TestDetectFormat ‑ test_json
tests.ludwig.data.test_format_registry.TestDetectFormat ‑ test_parquet
tests.ludwig.data.test_format_registry.TestDetectFormat ‑ test_unknown
tests.ludwig.data.test_format_registry.TestDetectFormatFromDataset ‑ test_dataframe
tests.ludwig.data.test_format_registry.TestDetectFormatFromDataset ‑ test_dict
tests.ludwig.data.test_format_registry.TestDetectFormatFromDataset ‑ test_string_path
…

♻️ This comment has been updated with latest results.

The TRAINING_PREPROC_FILE_NAME was changed to .parquet but the underlying
cache manager still writes HDF5. Revert to .hdf5 until the cache format
is actually migrated. Also skip test_contrib_comet on Python 3.12+ since
comet_ml uses the removed imp module.
Register OptunaExecutor in executor_registry so it can be selected via
config with executor.type=optuna. Rewrote executor to implement the
standard Ludwig execute() interface (train model per trial, collect
HyperoptResults). Updated run.py to allow optuna executor with local
backend.
Replace HDF5 caching layer with Parquet files for simpler, faster,
and more portable data caching. Key changes:

- PandasDatasetManager.save() now writes Parquet via PyArrow
- PandasDatasetManager.data_format returns "parquet"
- PandasDataset loads from Parquet by default, with legacy HDF5
  fallback for backward compatibility
- TRAINING_PREPROC_FILE_NAME changed to "training.parquet"
- Removed the out-of-memory H5 random access path (Parquet is always
  loaded fully into memory as numpy arrays)
- h5py and tables removed from core dependencies (h5py is optional
  for legacy HDF5 file loading)
- h5py import made conditional in fs_utils.py
Delegate extension-based format detection to the format_registry module
instead of inlining the extension-to-format mapping. Keeps special-case
handling for CacheableDataset, dask DataFrames, and ludwig:// / hf://
prefixes in the original function.
1. to_numpy_dataset() now accepts dict input (returns as-is with np.array conversion)
2. Parquet cache saves/loads N-D array shapes via sidecar .shapes.json files,
   fixing flattened image [H,W,C] and audio [T,F] arrays on round-trip
3. Image preprocessing: removed HDF5 out-of-memory path (upload_h5), always
   process in-memory since Parquet cache handles persistence
4. Audio preprocessing: same - always process in-memory, removed in_memory gate
5. Renamed data_hdf5_fp -> data_cache_fp across PandasDataset, RayDataset,
   preprocessing.py, test_batcher.py, and test_experiment.py
6. Dataset config discovery uses os.listdir instead of importlib.resources
   for reliable YAML file enumeration across install modes
7. Cache delete cleans up .shapes.json sidecar files alongside Parquet
8. _LazyRegistry.keys() includes lazy entries for better error messages
PandasDataset now restores N-D array shapes (images, audio) using
reshape metadata from training_set_metadata. This fixes the flattening
that happens during Parquet-compatible preprocessing.

Also removes HDF5 test variants and lazy-load tests since the HDF5
cache path has been replaced by Parquet.
- Add py-cpuinfo to dependencies (was transitive dep of tables, which
  was removed in HDF5-to-Parquet migration)
- Replace test_ray_lazy_load_audio_error with test_ray_audio_basic that
  skips the Ray-vs-local determinism check (tiny audio datasets produce
  non-deterministic roc_auc)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant