Skip to content

Conversation

@DivergeAI
Copy link

… done so far and provide feedback for Jules to continue.

… done so far and provide feedback for Jules to continue.
This commit addresses potential memory issues and ensures the ParameterPredictor is trained correctly in the adaptive superpixels feature.

Changes include:

1.  **Engine (`engine.py`):**
    *   Removed `torch.no_grad()` from the `ParameterPredictor` call during training in `train_one_epoch`. This allows the predictor to receive gradients and learn from the main task loss, as originally intended for joint optimization.
    *   Moved the `ParameterPredictor` forward pass into the `torch.cuda.amp.autocast()` context to enable mixed-precision training for the predictor, potentially reducing its memory footprint and speeding up its execution.
    *   Added explicit `del` statements for `pred_k`, `pred_m`, and `spix_id` tensors at the end of each training iteration where they are dynamically generated. This encourages more prompt garbage collection.

2.  **Datasets (`datasets.py`):**
    *   In `generate_superpixels`, added explicit `del` statements for intermediate CPU-bound NumPy arrays (`img_for_spix_normalized_cpu`, `img_for_spix_denormalized`, `img_for_spix_numpy`, `img_for_spix_rescaled`, `assignment`) and the `slic_engine` object at the end of each iteration within the batch processing loop. This is a precautionary measure to help CPU-side memory management.

These changes aim to make the training process more robust, particularly when the adaptive superpixels feature is enabled and the `ParameterPredictor` is being trained. The primary cause of previously reported memory hangs was likely the increased memory demand from enabling `ParameterPredictor` training, and these modifications ensure it's done correctly within typical PyTorch paradigms like AMP, along with more explicit memory handling.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant