-
-
Notifications
You must be signed in to change notification settings - Fork 308
Open
Description
I'm opening this PR to keep track of the work needed to port the content of the #996 PR to the main branch.
The idea is to split that PR (which is huge and based on a quite old version of the codebase) and, starting from the current state of the main branch, port its main elements in smaller PRs.
I'll keep this issue updated as I work on this.
Many changes are not strictly related to supporting distributed training but may benefit Avalanche in general.
- I'm starting with porting the modernized object detection/segmentation dataset, strategies, and metrics. I'll also port the generalized batch collate functionality.
Changes in Distributed Training PR #996:
Legend:
- 🔲 Not ported
- ⌛ Work in progress
- 💬 PR opened, discussion in progress
- ✔️ Merged into main branch
Base elements
- ✔️ DistributedHelper implementation (Add base elements to support distributed comms. Add supports_distributed plugin flag. #1370)
- 🔲 Distributed value, object, batch, model, tensor, ...
- ✔️ Distributed consistency (hashers) (Add base elements to support distributed comms. Add supports_distributed plugin flag. #1370)
- 🔲 Distributed training example (and runner script)
Strategy e plugins
- ✔️ New
supports_distributedplugin field (Add base elements to support distributed comms. Add supports_distributed plugin flag. #1370) - ✔️ New
_distributed_checkstrategy field and related_check_distributed_training_compatibility()check (Add base elements to support distributed comms. Add supports_distributed plugin flag. #1370) - 🔲 New
wrap_distributed_modelstrategy lifecycle method. Called from..._observation.py - ✔️
_obtain_common_dataloader_parametersstrategy method (unrelated to distributed training) (Add base elements to support distributed comms. Add supports_distributed plugin flag. #1370) - 🔲 Strategy support superclasses
- 🔲 Various plugin adaptations for distributed training (LwF, CWR, ...)
- ✔️ AR1: modernize to use
_obtain_common_dataloader_parameters(unrelated to distributed training) (Add base elements to support distributed comms. Add supports_distributed plugin flag. #1370) - 🔲 Strategy templates: wrap various lifecycle methods to allow for seamless support of distributed training
- Implementations should now be in
_backward(),_forward(), ... while wrapping happens inbackward,forward. Wrapper methods should be final, but Python is not strict on this (flexibility).
- Implementations should now be in
Models
- ✔️ Fixed device issues with dynamic models (unrelated to distributed training) (Add base elements to support distributed comms. Add supports_distributed plugin flag. #1370)
- ✔️ In
avalanche_forward, generalize usingis_multi_task_moduleto consider DDP wrapping (Add base elements to support distributed comms. Add supports_distributed plugin flag. #1370)
Detection
- ✔️ Detection scenario modernization (Typing system overhaul. Improve support for object detection scenarios. #1333)
- ✔️ Detection template (incl. Naive) modernization (Typing system overhaul. Improve support for object detection scenarios. #1333)
- ✔️ Updated detection example (Typing system overhaul. Improve support for object detection scenarios. #1333)
- ✔️ Detection dataset based on new dataset creation procedure (Typing system overhaul. Improve support for object detection scenarios. #1333)
- ⌛ Collate generalization
Data Loader
- ✔️ Use DistributedHelper, remove mock _DistributedHelper (Add base elements to support distributed comms. Add supports_distributed plugin flag. #1370)
- ✔️ Various fixes to address drop_last, shuffle, etcetera
Loggers and metrics
- ✔️ Disable loggers creation for non-main processes (Add base elements to support distributed comms. Add supports_distributed plugin flag. #1370)
- ✔️ Default logger: pass 'default' instead of loggers list (Add base elements to support distributed comms. Add supports_distributed plugin flag. #1370)
- ✔️ Strategies constructor: allow strategies to accept a factory for the
evaluatorconstructor parameter (evaluator=default_evaluator()->evaluator=default_evaluator). (Add base elements to support distributed comms. Add supports_distributed plugin flag. #1370) - ✔️ All strategy classes: change the default
evaluatorparameter value to use a factory. (Add base elements to support distributed comms. Add supports_distributed plugin flag. #1370)
Unit tests
- 🔲 Called in both environment-update and unit-test actions
- ✔️ Unit test runner: run_dist_tests.py and related utils (Add base elements to support distributed comms. Add supports_distributed plugin flag. #1370)
- End-to-end test script
Typing
- ✔️ Various typing fixes/integrations in AvalancheDataset and FlatData (Typing system overhaul. Improve support for object detection scenarios. #1333)
- Mostly to improve the programming experience of VSCode users
Metadata
Metadata
Assignees
Labels
No labels