Shape-bucketed compilation cache for training with a shared `TrainState`

Following up from Slack with @wsmoses and @avik-pal on training under `AutoReactant()` when one input axis has variable length across samples.

The workable approach is to choose `N` bucket sizes, pad each input up to its nearest bucket, and cache one compiled executable per bucket — bounding compilations to `N` and keeping padding overhead under the user's control. The Qwen3 inference example ([examples/Qwen3/main.jl](https://github.com/LuxDL/Lux.jl/blob/main/examples/Qwen3/main.jl), cf. `CachedReactantThunks`) already does this for generation, keying `Reactant.Compiler.Thunk`s on `(config, size)` while keeping `ps`/`st` in a separate struct so all compiled variants share the same parameter buffers.

The ask is a training-side equivalent. `Lux.Training.single_train_step!(AutoReactant(), ...)` currently specializes on the shapes of `data` and caches inside `TrainState`, so the naive bucketed version would allocate one `TrainState` per bucket — duplicating `ps`, optimizer state, and gradient scratch `N` times, which doesn't fit in VRAM for reasonable model sizes. What's needed is a supported pattern where one `TrainState` holds a single copy of parameters and optimizer state, and compiled train-step artifacts are cached separately, keyed on input shape (or a user-supplied bucket key).

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Shape-bucketed compilation cache for training with a shared `TrainState` #1704

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Shape-bucketed compilation cache for training with a shared TrainState #1704

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions

Shape-bucketed compilation cache for training with a shared `TrainState` #1704