Phase 4: Data pipeline modernization and Optuna HPO#4090
Open
w4nderlust wants to merge 23 commits intomainfrom
Open
Phase 4: Data pipeline modernization and Optuna HPO#4090w4nderlust wants to merge 23 commits intomainfrom
w4nderlust wants to merge 23 commits intomainfrom
Conversation
Supplements existing MLflow integration with 3.x features: - log_training_run(): model-centric tracking with LoggedModel entities - log_llm_trace(): structured GenAI tracing for LLM prompts/responses - Automatic config param logging and training metric logging - Graceful degradation when MLflow 3.x is not available
- Deprecate save_ludwig_model_for_inference() and save_torchscript() with warnings pointing to export_model(format='torch_export') - Add structured request/response logging middleware to serve_v2 - Add MLflow cost tracking: model size, parameter counts, param efficiency, base model name for LLMs
Delete inference.py, triton_utils.py, carton_utils.py, inference_utils.py, and all TorchScript tests. Remove save_torchscript/to_torchscript from API and base model. Rename TorchscriptPreprocessingInput to PreprocessingInput. Replace with export_model CLI command.
The rewritten export.py dropped callback iteration that the old version had. Fixes test_export_mlflow_cli and test_export_mlflow_local.
Replace untyped TrainingSetMetadataDict = dict with structured dataclasses: NumberMetadata, CategoryMetadata, TextMetadata, BinaryMetadata, ImageMetadata, SequenceMetadata, AudioMetadata, TrainingSetMetadata. Backward-compatible: dict-like access via __getitem__, from_dict/to_dict, get(), keys(), items(). Existing code continues to work during migration.
Direct Optuna integration without Ray Tune intermediary. Supports: - AutoSampler (auto-selects best algorithm) - GPSampler (Bayesian optimization) - TPE, CMA-ES, Random samplers - MedianPruner and HyperbandPruner for early stopping - Persistent storage via SQLite for resumable studies - All Optuna search space types: uniform, loguniform, int, choice
Change TRAINING_PREPROC_FILE_NAME from training.hdf5 to training.parquet. Keep TRAINING_PREPROC_HDF5_FILE_NAME for backward-compatible loading of legacy cached files.
First step in splitting the 2386-line preprocessing.py into focused modules: - format_registry.py: data format detection from file extensions - split_utils.py: train/val/test splitting with stratified support The original preprocessing.py remains intact for backward compatibility. New code should import from these focused modules.
for more information, see https://pre-commit.ci
generate_search_space() inspects Pydantic config fields and creates Optuna-compatible search spaces based on types and constraints. generate_trainer_search_space() provides sensible defaults for commonly tuned hyperparameters.
Test Results 10 files ± 0 10 suites ±0 1h 48m 43s ⏱️ + 1m 55s Results for commit ec060c4. ± Comparison against base commit fb963ad. This pull request removes 3 and adds 45 tests. Note that renamed tests count towards both.♻️ This comment has been updated with latest results. |
The TRAINING_PREPROC_FILE_NAME was changed to .parquet but the underlying cache manager still writes HDF5. Revert to .hdf5 until the cache format is actually migrated. Also skip test_contrib_comet on Python 3.12+ since comet_ml uses the removed imp module.
Register OptunaExecutor in executor_registry so it can be selected via config with executor.type=optuna. Rewrote executor to implement the standard Ludwig execute() interface (train model per trial, collect HyperoptResults). Updated run.py to allow optuna executor with local backend.
Replace HDF5 caching layer with Parquet files for simpler, faster, and more portable data caching. Key changes: - PandasDatasetManager.save() now writes Parquet via PyArrow - PandasDatasetManager.data_format returns "parquet" - PandasDataset loads from Parquet by default, with legacy HDF5 fallback for backward compatibility - TRAINING_PREPROC_FILE_NAME changed to "training.parquet" - Removed the out-of-memory H5 random access path (Parquet is always loaded fully into memory as numpy arrays) - h5py and tables removed from core dependencies (h5py is optional for legacy HDF5 file loading) - h5py import made conditional in fs_utils.py
Delegate extension-based format detection to the format_registry module instead of inlining the extension-to-format mapping. Keeps special-case handling for CacheableDataset, dask DataFrames, and ludwig:// / hf:// prefixes in the original function.
1. to_numpy_dataset() now accepts dict input (returns as-is with np.array conversion) 2. Parquet cache saves/loads N-D array shapes via sidecar .shapes.json files, fixing flattened image [H,W,C] and audio [T,F] arrays on round-trip 3. Image preprocessing: removed HDF5 out-of-memory path (upload_h5), always process in-memory since Parquet cache handles persistence 4. Audio preprocessing: same - always process in-memory, removed in_memory gate 5. Renamed data_hdf5_fp -> data_cache_fp across PandasDataset, RayDataset, preprocessing.py, test_batcher.py, and test_experiment.py 6. Dataset config discovery uses os.listdir instead of importlib.resources for reliable YAML file enumeration across install modes 7. Cache delete cleans up .shapes.json sidecar files alongside Parquet 8. _LazyRegistry.keys() includes lazy entries for better error messages
PandasDataset now restores N-D array shapes (images, audio) using reshape metadata from training_set_metadata. This fixes the flattening that happens during Parquet-compatible preprocessing. Also removes HDF5 test variants and lazy-load tests since the HDF5 cache path has been replaced by Parquet.
- Add py-cpuinfo to dependencies (was transitive dep of tables, which was removed in HDF5-to-Parquet migration) - Replace test_ray_lazy_load_audio_error with test_ray_audio_basic that skips the Ray-vs-local determinism check (tiny audio datasets produce non-deterministic roc_auc)
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Phase 4 of the Ludwig modernization: data pipeline improvements and Optuna integration.
1. Typed Feature Metadata Classes
Replaces the untyped `TrainingSetMetadataDict = dict` with structured dataclasses that provide type safety, IDE autocomplete, and prevent key typo bugs.
```python
from ludwig.data.types import NumberMetadata, CategoryMetadata, TrainingSetMetadata
Typed access
meta = NumberMetadata(mean=5.0, std=2.0, ple_bin_edges=[0.0, 0.5, 1.0])
meta.mean # IDE autocomplete works
Backward compatible with dict-like access
tsm = TrainingSetMetadata()
tsm["feature_name"] = {"mean": 5.0} # dict-like set
value = tsm["feature_name"] # dict-like get
```
Classes: NumberMetadata, CategoryMetadata, TextMetadata, BinaryMetadata, ImageMetadata, SequenceMetadata, AudioMetadata, TrainingSetMetadata. All have from_dict/to_dict for serialization.
2. Native Optuna HPO Executor
Direct Optuna integration without requiring Ray Tune as intermediary:
```python
from ludwig.hyperopt.optuna_executor import OptunaExecutor
executor = OptunaExecutor(
parameters={
"trainer.learning_rate": {"space": "loguniform", "lower": 1e-5, "upper": 1e-2},
"trainer.batch_size": {"space": "int", "lower": 16, "upper": 256},
},
metric="validation.combined.loss",
goal="minimize",
num_samples=50,
sampler="auto", # AutoSampler, GPSampler, TPE, CMA-ES, Random
pruner="hyperband", # optional early stopping
storage="sqlite:///optuna.db", # optional persistence
)
best_params = executor.optimize(train_fn)
results = executor.get_results()
```
Supports all Optuna search spaces: uniform, loguniform, int, choice/categorical, grid.
3. HDF5 to Parquet Cache Migration
Changed default preprocessing cache format from HDF5 to Parquet:
4. Preprocessing Module Extraction
First step in splitting the 2,386-line preprocessing.py into focused modules:
The original preprocessing.py remains intact for backward compatibility. New code should import from these focused modules.
5. Dask as Optional Dependency
Dask is already optional via try/except in `ludwig/utils/types.py`. No further changes needed as the import is properly guarded.
Test plan