Intro - Results - Acknowledgements
This code is for our study of the optimization trajectories of learned vs. traditional optimizers from the perspective of network architecture symmetries and proposed parameter update distributions. Full paper: Investigation into the Training Dynamics of Learned Optimizers, Jan Sobotka, Petr Šimánek, Daniel Vašata, 2023.
- Similarly to Lion, learned optimizers break the geometric constraints on gradients that stem from architectural symmetries and that deviations from these constraints are significantly larger than those observed with previous optimizers like Adam or SGD (Figure 4). In the case of learned optimizers, we observe that a large deviation from these geometric constraints almost always accompanies the initial rapid decrease in loss during optimization. More importantly, regularizing against this symmetry breaking during meta-training severely damages performance, hinting at the importance of this freedom in L2O parameter updates.
- In another experiment, we also see that the increasing symmetry breaking of the Lion-SGD optimizer (interpolation between Lion and SGD parameter updates) correlates with an increase in performance. This indicates that breaking the strict geometric constraints might be beneficial not only for L2O but also for more traditional, manually designed optimization algorithms.
- Additionally, on the Figure 9 below, one can notice that the L2O starts with the largest updates and then slowly approaches the update distribution of Adam.
- Furthermore, by studying the noise and covariance in the L2O parameter updates, we demonstrate that, on the one hand, L2O updates exhibit less heavy-tailed stochastic noise (Figure 8, left; higher alpha - less heavy-tailed), and, on the other hand, the variation in updates across different samples is larger. This less heavy-tailed distribution of L2O updates despite the gradients exhibiting very heavy-tailed behavior, together with the high variation of updates across different samples, points to one interesting observation: L2O appears to act as a stabilizing force in the optimization process. While the inherent stochasticity and heavy-tailed nature of gradients might lead to erratic updates and slow convergence, the noise clipping of L2O seems to mitigate these issues.
| Figure 9: Histograms of the absolute values of parameter updates. | Figure 8: Heavy-tailedness and update covariance. |
|---|---|
![]() |
![]() |
- Pytorch version of NIPS'16 "Learning to learn by gradient descent by gradient descent" chenwydj/learning-to-learn-by-gradient-descent-by-gradient-descent.
- Original L2O code from AdrienLE/learning_by_grad_by_grad_repro.
- Meta modules from danieltan07/learning-to-reweight-examples.




