How can one force a simple vanilla training process? #21278

hmf · 2025-10-09T11:39:04Z

hmf
Oct 9, 2025

I am currently (3rd attempt) trying to replicate an experiment that is coded in Pytorch (training, validation and test loops). The original code seems to be able to overfit - the training and validation loss keep decreasing, albeit slowly. It goes on until a max 200 epochs. Basically a very simple AE is trained to reconstruct cropped samples.

In this new attempt I have altered the code as little as possible. All I have done is replace the training with a Trainer.fit. I then use the original test script to compare the results. The following metrics are used: SSIM and MSE. With the Trainer version, these values are large. The training loss initially drops quickly and is approximately the same as the original code. But towards the end (until max 200 epoch) it oscillates a little without ever decreasing. If I activate early stopping, training is stopped anywhere from 48 to 75 epochs.

My latest changes use the following Trainer related code:

        # Use all batches
        limit_train_batches=1.0
        tb_logger=None
        callbacks: list[L.Callback] = [self.early_stopping(), self.overfitting_stopping()]
        callbacks: list[L.Callback] = []
        trainer = L.Trainer(
                        # new
                        accelerator="gpu",
                        devices=1,
                        num_nodes=1, # default
                        check_val_every_n_epoch=1, # default
                        inference_mode=False, # default true. Only affects validation, test and prediction phases
                        use_distributed_sampler=False,
                        precision="32-true", # default
                        accumulate_grad_batches=1, # default
                        # old
                        limit_train_batches=limit_train_batches, # default 1
                        max_epochs=max_epochs,
                        # Can have more than one
                        logger=tb_logger,
                        callbacks=callbacks,
                        # log_every_n_steps=50,
                        )
        trainer.fit(model=self.model, train_dataloaders=self.train_dataloader, val_dataloaders=self.valid_dataloader)

My question is, what other parameters can I try in order to use the a simple, vanilla training loop? Any diagnosis I can do? Anything else I an try out? For reference, below I show the original code.

TIA,
HF

The original train loop is:

    def fit(self) -> None:
        """
        Train the model over multiple epochs and manage model saving, learning rate scheduling, and early stopping.

        Returns:
            None
        """
        assert self.train_cfg is not None
        
        best_valid_loss = float('inf')
        best_model_path = None
        early_stopping_counter = 0

        train_losses = []
        valid_losses = []

        for epoch in tqdm(range(self.train_cfg.get("epochs")), desc=colorama.Fore.LIGHTRED_EX + "Epochs"):
            train_losses = self.train_loop(epoch, train_losses)
            valid_losses = self.valid_loop(valid_losses)

            if self.train_cfg.get("decrease_learning_rate"):
                self.scheduler.step()

            train_loss = np.average(train_losses)
            valid_loss = np.average(valid_losses)
            self.writer.add_scalars("Loss", {"Train": train_loss, "Valid": valid_loss}, epoch)

            logging.info(f"Train Loss: {train_loss:.5f} valid Loss: {valid_loss:.5f}")

            train_losses.clear()
            valid_losses.clear()

            if valid_loss < best_valid_loss:
                best_valid_loss = valid_loss
                if best_model_path is not None:
                    os.remove(best_model_path)
                best_model_path = os.path.join(str(self.save_path), "epoch_" + str(epoch) + ".pt")
                torch.save(self.model.state_dict(), best_model_path)
                logging.info(f"New weights have been saved at epoch {epoch} with value of {valid_loss:.5f}")
                early_stopping_counter = 0
            else:
                logging.warning(f"No new weights have been saved. Best valid loss was {best_valid_loss:.5f},\n "
                                f"current valid loss is {valid_loss:.5f}")

                early_stopping_counter += 1
                logging.warning(f"Early stopping counter: {early_stopping_counter}")

                if early_stopping_counter >= self.train_cfg.get("early_stopping"):
                    logging.info(f"Early stopping at epoch {epoch}")
                    break

        self.writer.close()
        self.writer.flush()

The original train loop is as follows:

    def train_loop(self, epoch: int, train_losses: list) -> list:
        """
        Executes the training loop for a given epoch.

        Args:
            epoch (int): The current epoch number, used for logging and visualization.
            train_losses (list): A list that accumulates the training loss values for each batch.

        Returns:
            list: The updated `train_losses` list with the loss values from the current epoch.
        """

        self.model.train()
        for batch_idx, data in tqdm(
                enumerate(self.train_dataloader),
                total=len(self.train_dataloader),
                desc=colorama.Fore.GREEN + "Training"
        ):
            results = self.forward_step(data)

            noise_images=None
            if len(results) == 3:
                images, recon, train_loss = results
            else:
                images, noise_images, recon, train_loss = results

            self.optimizer.zero_grad()
            train_loss.backward()
            self.optimizer.step()
            train_losses.append(train_loss.item())

            assert self.train_cfg is not None
            if (self.train_cfg.get("vis_during_training") and (epoch % self.train_cfg.get("vis_interval") == 0) and
                    batch_idx == 0):
                dataset_type=dataset_data_path_selector().get(self.dataset_type)
                assert dataset_type is not None
                training_vis=dataset_type.get("training_vis")
                assert training_vis is not None
                vis_dir = (
                    create_save_dirs(
                        directory_path=training_vis,
                        network_type=self.network_type,
                        timestamp=self.timestamp
                    )
                )

                if self.network_type in ["AE", "AEE"]:
                    visualize_images(
                        clean_images=images,
                        outputs=recon,
                        epoch=epoch,
                        batch_idx=batch_idx,
                        dir_path=str(vis_dir)
                    )
                else:
                    assert noise_images is not None
                    visualize_images(
                        clean_images=images,
                        outputs=recon,
                        noise_images=noise_images,
                        epoch=epoch,
                        batch_idx=batch_idx,
                        dir_path=str(vis_dir)
                    )

        return train_losses

hmf · 2025-10-09T16:26:35Z

hmf
Oct 9, 2025
Author

Problem was in the model (incorrect scheduler).
Apologies for the noise.

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

How can one force a simple vanilla training process? #21278

Uh oh!

{{title}}

Uh oh!

Replies: 1 comment

Uh oh!

{{title}}

Uh oh!

Select a reply

Uh oh!

How can one force a simple vanilla training process? #21278

Uh oh!

hmf Oct 9, 2025

Replies: 1 comment

Uh oh!

hmf Oct 9, 2025 Author

hmf
Oct 9, 2025

hmf
Oct 9, 2025
Author