Skip to content

Releases: Lightning-AI/pytorch-lightning

Lightning 1.8: Colossal-AI Strategy, Commands and Secrets for Apps, FSDP Improvements and More!

01 Nov 11:13
7ee0994

Choose a tag to compare

The core team is excited to announce the release of Lightning 1.8 ⚡

Lightning v1.8 is the culmination of work from 52 contributors who have worked on features, bug-fixes, and documentation for a total of over 550+ commits since v1.7.

Highlights

Colossal-AI

Colossal-AI focuses on improving efficiency when training large-scale AI models with billions of parameters. With the new Colossal-AI strategy in Lightning 1.8, you can train existing models like GPT-3 with up to half as many GPUs as usually needed. You can also train models up to twice as big with the same number of GPUs, saving you significant cost. Here is how you use it:

# Select the strategy with good defaults
trainer = Trainer(strategy="colossalai")

# or tune parameters to your liking
from lightning.pytorch.strategies import ColossalAIStrategy

trainer = Trainer(strategy=ColossalAIStrategy(placement_policy="cpu", ...))

You can find Colossal-AI's benchmarks with Lightning on GPT-2 here.

Under the hood, Colossal-AI implements different parallelism algorithms that are especially interesting for the development of SOTA transformer models:

  • Data Parallelism
  • Pipeline Parallelism
  • 1D, 2D, 2.5D, 3D Tensor Parallelism
  • Sequence Parallelism
  • Zero Redundancy Optimization

Learn how to install and use Colossal-AI effectively with Lightning here.

NOTE: This strategy is marked as experimental. Stay tuned for more updates in the future.

Secrets for Lightning Apps

Introducing encrypted secrets (#14612), a feature requested by Lightning App users 🎉!

Encrypted secrets allow you to securely pass private data to your apps, like API keys, access tokens, database passwords, or other credentials, without exposing them in your code.

  1. Add a secret to your Lightning account in lightning.ai (read more here)

  2. Add an environment variable to your app to read the secret:

    # somewhere in your Flow or Work:
    GitHubComponent(api_token=os.environ["API_TOKEN"])
  3. Pass the secret to your app run with the following command:

    lightning run app app.py --cloud --secret API_TOKEN=github_api_token

These secrets are encrypted and stored in the Lightning database. Nothing except your app can access the value.

NOTE: This is an experimental feature.

CLI Commands for Lightning Apps

Introducing CLI commands for apps (#13602)!
As a Lightning App builder, if you want to easily create a CLI interface for users to interract with your app, then this is for you.

Here is an example where users can dynamically create notebooks from the CLI.
All you need to do is implement the configure_commands hook on the LightningFlow:

import lightning as L
from commands.notebook.run import RunNotebook


class Flow(L.LightningFlow):
    ...

    def configure_commands(self):
        # Return a list of dictionaries with commands:
        return [{"run notebook": RunNotebook(method=self.run_notebook)}]


app = L.LightningApp(Flow())

Once the app is running with lightning run app app.py, you can connect to the app with the following command:

lightning connect {app name} -y

and run the command that was configured:

lightning run notebook --name=my_notebook_name

For a full tutorial and running example, visit our docs. TODO: add to docs
NOTE: This is an experimental feature.

Auto-wrapping for FSDP Strategy

In Lightning v1.7, we introduced an integration for PyTorch FSDP in the form of our FSDP strategy, which allows you to train huge models with billions of parameters sharded across hundreds of GPUs and machines.

# Native FSDP implementation
trainer = Trainer(strategy="fsdp_native")

We are continuing to improve the support for this feature by adding automatic wrapping of layers for use cases where the model fits into CPU memory, but not into GPU memory (#14383).

Here are some examples:

Case 1: Model is so large that it does not fit into CPU memory.
Construct your layers in the configure_sharded_model hook and wrap the large ones you want to shard across GPUs:

class MassiveModel(LightningModule):
    ...
    
    # Create model here and wrap the large layers for sharding
    def configure_sharded_model(self):
        for i, layer in enumerate(self.block):
            self.block[i] = wrap(layer)
        ...

Case 2: Model fits into CPU memory, but not into GPU memory. In Lightning v1.8, you no longer need to do anything special here, as we can automatically wrap the layers for you using FSDP's policy:

model = MassiveModel()
trainer = Trainer(
    accelerator="gpu", 
    devices=8, 
    strategy="fsdp_native",  # or strategy="fsdp" for fairscale
    precision=16
)

# Automatically wraps the layers here:
trainer.fit(model)

Case 3: Model fits into GPU memory. No action required, use any strategy you want.

Note: if you want to manually wrap layers for more control, you can still do that!

Read more about FSDP and how layer wrapping works in our docs.

New Tuner Callbacks

In this release, we focused on Tuner improvements and introduced two new callbacks that can help you customize the batch size finder and learning rate finder as per your use case.

Batch Size Finder (#11089)

  1. You can customize the BatchSizeFinder callback to run at different epochs. This feature is useful while fine-tuning models since you can't always use the same batch size after unfreezing the backbone.

    from lightning.pytorch.callbacks import BatchSizeFinder
    
    
    class FineTuneBatchSizeFinder(BatchSizeFinder):
        def __init__(self, milestones, *args, **kwargs):
            super().__init__(*args, **kwargs)
            self.milestones = milestones
    
        def on_fit_start(self, *args, **kwargs):
            return
    
        def on_train_epoch_start(self, trainer, pl_module):
            if trainer.current_epoch in self.milestones or trainer.current_epoch == 0:
                self.scale_batch_size(trainer, pl_module)
    
    
    trainer = Trainer(callbacks=[FineTuneBatchSizeFinder(milestones=(5, 10))])
    trainer.fit(...)
  2. Run batch size finder for validate/test/predict.

    from lightning.pytorch.callbacks import BatchSizeFinder
    
    
    class EvalBatchSizeFinder(BatchSizeFinder):
        def __init__(self, *args, **kwargs):
            super().__init__(*args, **kwargs)
    
        def on_fit_start(self, *args, **kwargs):
            return
    
        def on_test_start(self, trainer, pl_module):
            self.scale_batch_size(trainer, pl_module)
    
    
    trainer = Trainer(callbacks=[EvalBatchSizeFinder()])
    trainer.test(...)

Learning Rate Finder (#13802)

You can now use the LearningRateFinder callback to run at different intervals. This feature is useful when fine-tuning models, for example.

from lightning.pytorch.callbacks import LearningRateFinder


class FineTuneLearningRateFinder(LearningRateFinder):
    def __init__(self, milestones, *args, **kwargs):
        super().__init__(*args, **kwargs)
        self.milestones = milestones

    def on_fit_start(self, *args, **kwargs):
        return

    def on_train_epoch_start(self, trainer, pl_module):
        if trainer.current_epoch in self.milestones or trainer.current_epoch == 0:
            self.lr_find(trainer, pl_module)

trainer = Trainer(callbacks=[FineTuneLearningRateFinder(milestones=(5, 10))])
trainer.fit(...)

LightningCLI Improvements

Even though the LightningCLI class is designed to help in the implementation of command line tools, there are instances when it might be more desirable to run directly from Python. In Lightning 1.8, you can now do this (#14596):

from lightning.pytorch.cli import LightningCLI

def cli_main(args):
    cli = LightningCLI(MyModel, ..., args=args)
    ...

Anywhere in your program, you can now call the CLI directly:

cli_main(["--trainer.max_epochs=100", "--model.encoder_layers=24"])

Learn about all features of the LightningCLI!

Improvements to the SLURM Support

Multi-node training on a SLURM cluster has been supported since the inception of Lightning Trainer, and has seen several improvements over time thanks to many community contributions. And we just keep going! In this release, we've added two quality of life improvements:

  • The preemption/termination signal is now configurable (#14626):

    # the default signal is SIGUSR1
    trainer = Trainer(plugins=[...
Read more

Apps's secrets & meta tags

20 Oct 15:07
65d29f0

Choose a tag to compare

[0.7.0] - 2022-10-20

Added

  • Add --secret option to CLI to allow binding Secrets to app environment variables when running in the cloud (#14612)
  • Added support for adding descriptions to commands either through a docstring or the DESCRIPTION attribute (#15193
  • Added option to add custom meta tags to the UI container (#14915)
  • Added support to pass a LightningWork to the LightningApp (#15215

Changed

  • Allowed root path to run the app on /path (#14972)

App with meta tags

07 Oct 20:45

Choose a tag to compare

[0.6.3] - 2022-10-07

Added

  • Added option to add custom meta tags to the UI container (#14915)

Changed

  • Allowed root path to run the app on /path (#14972)

Contributors

@pritamsoni-hsr

If we forgot someone due to not matching commit email with GitHub account, let us know :]

PyTorch Lightning 1.7.7: Standard patch release

22 Sep 13:43

Choose a tag to compare

[1.7.7] - 2022-09-22

Fixed

  • Fixed the availability check for the neptune-client package (#14714)
  • Break HPU Graphs into two parts (forward + backward as one and optimizer as another) for better performance (#14656)
  • Fixed torchscript error with ensembles of LightningModules (#14657, #14724)
  • Fixed an issue with TensorBoardLogger.finalize creating a new experiment when none was created during the Trainer's execution (#14762)
  • Fixed TypeError on import when torch.distributed is not available (#14809)

Contributors

@awaelchli @Borda @carmocca @dependabot @otaj @raoakarsha

If we forgot someone due to not matching commit email with GitHub account, let us know :)

Minor patch release

22 Sep 15:55

Choose a tag to compare

[0.6.2] - 2022-09-22

Changed

  • Improved Lightning App connect logic by disconnecting automatically (#14532)
  • Improved the error message when the LightningWork is missing the run method (#14759)
  • Improved the error message when the root LightningFlow passed to LightningApp is missing the run method (#14760)

Fixed

  • Fixed a bug where the uploaded command file wasn't properly parsed (#14532)
  • Fixed an issue where custom property setters were not being used LightningWork class (#14259)
  • Fixed an issue where some terminals would display broken icons in the PL app CLI (#14226)

Contributors

@awaelchli, @Borda, @pranjaldatta, @tchaton

If we forgot someone due to not matching commit email with GitHub account, let us know :]

Memory fixes inbound!

19 Sep 16:26

Choose a tag to compare

[0.6.1] - 2022-09-19

Added

  • Add support to upload files to the Drive through an asynchronous upload_file endpoint (#14703)

Changed

  • Application storage prefix moved from app_id to project_id/app_id (#14583)
  • LightningCloud client calls to use keyword arguments instead of positional arguments (#14685)

Fixed

  • Making threadpool non-default from LightningCloud client (#14757)
  • Resolved a bug where the state change detection using DeepDiff won't work with Path, Drive objects (#14465)
  • Resolved a bug where the wrong client was passed to collect cloud logs (#14684)
  • Resolved the memory leak issue with the Lightning Cloud package and bumped the requirements to use the latest version (#14697)
  • Fixing 5000 log line limitation for Lightning AI BYOC cluster logs (#14458)
  • Fixed a bug where the uploaded command file wasn't properly parsed (#14532)
  • Resolved LightningApp(..., debug=True) (#14464)

Contributors

@dmitsf @hhsecond @tchaton @nohalon @krshrimali @pritamsoni-hsr @nmiculinic @ethanwharris @yurijmikhalevich @Felonious-Spellfire @otaj @Borda

If we forgot someone due to not matching commit email with GitHub account, let us know :)

PyTorch Lightning 1.7.6: Standard patch release

13 Sep 19:19

Choose a tag to compare

[1.7.6] - 2022-09-13

Changed

  • Improved the error messaging when passing Trainer.method(model, x_dataloader=None) with no module-method implementations available (#14614)

Fixed

  • Reset the dataloaders on OOM failure in batch size finder to use the last successful batch size (#14372)
  • Fixed an issue to keep downscaling the batch size in case there hasn't been even a single successful optimal batch size with mode="power" (#14372)
  • Fixed an issue where self.log-ing a tensor would create a user warning from PyTorch about cloning tensors (#14599)
  • Fixed compatibility when torch.distributed is not available (#14454)

Contributors

@akihironitta @awaelchli @Borda @carmocca @dependabot @krshrimali @mauvilsa @pierocor @rohitgr7 @wangraying

If we forgot someone due to not matching commit email with GitHub account, let us know :)

BYOC cluster management

08 Sep 12:44
9251269

Choose a tag to compare

[0.6.0] - 2022-09-08

Added

  • Introduce lightning connect (#14452)
  • Adds PanelFrontend to easily create complex UI in Python (#13531)
  • Add support for Lightning App Commands through the configure_commands hook on LightningFlow and ClientCommand (#13602)
  • Add support for Lightning AI BYOC cluster management (#13835)
  • Add support to see Lightning AI BYOC cluster logs (#14334)
  • Add support to run Lightning apps on Lightning AI BYOC clusters (#13894)
  • Add support for listing Lightning AI apps (#13987)
  • Adds LightningTrainingComponent. LightningTrainingComponent orchestrates multi-node training in the cloud (#13830)
  • Add support for printing application logs using CLI lightning show logs <app_name> [components] (#13634)
  • Add support for Lightning API through the configure_api hook on the LightningFlow and the Post, Get, Delete, Put with HttpMethods (#13945)
  • Added a warning when configure_layout returns URLs configured with HTTP instead of HTTPS (#14233)
  • Add --app_args support from the CLI (#13625)

Changed

  • Default values and parameter names for Lightning AI BYOC cluster management (#14132)
  • Run the flow only if the state has changed from the previous execution (#14076)
  • Increased DeepDiff's verbose level to properly handle dict changes (#13960)
  • Setup: added requirement freeze for the next major version (#14480)

Fixed

  • Unification of app template: moved app.py to root dir for lightning init app <app_name> template (#13853)
  • Fixed an issue with lightning --version command (#14433)
  • Fixed imports of collections.abc for py3.10 (#14345)

Contributors

@adam-lightning, @awaelchli, @Borda, @dmitsf, @manskx, @MarcSkovMadsen, @nicolai86, @tchaton

If we forgot someone due to not matching commit email with GitHub account, let us know :]

PyTorch Lightning 1.7.5: Standard patch release

07 Sep 04:08

Choose a tag to compare

[1.7.5] - 2022-09-06

Fixed

  • Squeezed tensor values when logging with LightningModule.log (#14489)
  • Fixed WandbLogger save_dir is not set after creation (#14326)
  • Fixed Trainer.estimated_stepping_batches when maximum number of epochs is not set (#14317)

Contributors

@carmocca @dependabot @robertomest @rohitgr7 @tshu-w

If we forgot someone due to not matching commit email with GitHub account, let us know :)

PyTorch Lightning 1.7.4: Standard patch release

31 Aug 17:21

Choose a tag to compare

[1.7.4] - 2022-08-31

Added

  • Added an environment variable PL_DISABLE_FORK that can be used to disable all forking in the Trainer (#14319)

Fixed

  • Fixed LightningDataModule hparams parsing (#12806)
  • Reset epoch progress with batch size scaler (#13846)
  • Fixed restoring the trainer after using lr_find() so that the correct LR schedule is used for the actual training (#14113)
  • Fixed incorrect values after transferring data to an MPS device (#14368)

Contributors

@rohitgr7 @tanmoyio @justusschock @cschell @carmocca @Callidior @awaelchli @j0rd1smit @dependabot @Borda @otaj