Skip to content

4 Hyperparameters robustness

Ruslan Shaiakhmetov edited this page Jun 20, 2023 · 1 revision

The experiment at hand involves a crucial step of hyperparameter optimization, which adds an intriguing aspect to delve into the dynamics of these hyperparameters. One relevant approach is to quantify the standard deviation of the learning rate throughout the experimental process.

It is important to acknowledge that directly comparing hyperparameters across algorithms from different classes would be an unfair assessment. However, it is worth noting that the learning rate is a common element present in all the algorithms considered, and it is mathematically identical. This observation highlights a general understanding of the extent to which the learning rate spreads across different values of the b parameter in the Rosenbrock function.

Within this context, examining the behavior of algorithms with fixed hyperparameters across the entire range of the parameter b offers valuable insights. Let us consider the hyperparameters for b=100.0, as it represents a highly intricate coefficient where lower values of b are expected to yield feasible yet poorer performance.

In observing the results, a notable observation is that Rprop remains remarkably accurate, much more accurate compared to other algorithms, espeshially for small values of the coefficient $b$. The remaining algorithms can be discernibly clustered into three distinct groups based on their sensitivity to b: those less influenced (NovoGrad, QHAdam, Adam, AdamP, SWATS, AdaMax), moderately influenced (DiffGrad, AdaMod, Lamb, NAdam, RMSprop, Yogi, AMSgrad, AdaGrad, MadGrad), and highly influenced (AdaBound, PID, SGDW, SGD, AggMo, RAdam).

To assess the impact of fixed hyperparameters, we can calculate the slope of the linearized loss and compare it to the original slope. This allows us to evaluate how the fixed hyperparameters influence the overall behavior. The difference between these slopes can be determined by calculating the relative difference between the original slope and the slope obtained with fixed hyperparameters. By visualizing this difference against the learning rate variance of the algorithms, we can gain deeper insights into the robustness of the algorithms' hyperparameters.

The results provides a clear depiction of how different algorithms behave, highlighting a consistent negative linear relationship for all algorithms except Lamb. If the slope of the decline were insignificant, we would expect a wider range of values for the learning rate variance. This observation implies that having a broader distribution of learning rates, rather than fixating on a precise optimal value, is not crucial since fluctuations in the learning rate have only a marginal impact on performance. And vice versa. It is worth noting that while the observed trend holds true, the specific values for each algorithm are scattered due to their unique characteristics. These individual traits contribute to the variation in performance outcomes. Thence, the diagram serves as a valuable point of reference when selecting an algorithm for a particular task, providing insights into the expected behavior of different algorithms and aiding in informed decision-making.

Clone this wiki locally