diff --git a/projects/learning-gradient-descent-with-synthetic-objectives.md b/projects/learning-gradient-descent-with-synthetic-objectives.md new file mode 100644 index 0000000..251ad44 --- /dev/null +++ b/projects/learning-gradient-descent-with-synthetic-objectives.md @@ -0,0 +1,32 @@ +Title: Learning Gradient Descent with Synthetic Objective Functions +Tagline: Develop techniques for training gradient descent optimizers for neural networks +Date: November 2016 +Category: Fundamental Research +Mailing list: https://groups.google.com/forum/#!forum/aion-learning-gradient-descent-with-synthetic-objectives +Contact: Chris Ratcliff - c.j.ratcliff@gmail.com + + +## Problem description +Current optimization algorithms for neural networks such as SGD, RMSProp and Adam are hand-crafted and generally quite simple. This can be partly explained by the high-dimensional, non-convex nature of neural network's objective functions which human intuition, normally limited to three spatial dimensions, is not well-suited for. A learning algorithm, therefore, may be able to design a superior optimizer. + +Recently, Andrychowicz et al. attempted to solve the problem by training an LSTM which takes the gradient at a point and its hidden state as input and outputs the proposed update to the parameters of the net which is being trained. They trained on one optimization problem (such as the MNIST dataset with an MLP) at a time but found that it failed to generalize properly, even to networks with the same architecture but using a different activation function. Using 'synthetic' objective functions (i.e. explicitly specified functions in the same way a quadratic equation is) allows an arbitrary number of functions to be generated at negligible cost, increasing generalization by having a training set that is effectively infinite. + + +## Why this problem matters +Optimization is at the heart of deep learning with the choice of algorithm used affecting both accuracy and training time. As neural networks become deeper they are also likely to become harder to train, necessitating the use of more sophisticated optimizers. + + +## How to measure success +A graph of training loss against the number of iterations for a network trained under different algorithms is commonly used to see which algorithm is better by simple visual inspection. One may also consider plotting loss against time rather than the number of iterations. This would be a harder metric, given the expense of computing one step of an LSTM compared to a single scalar multiplication, as in standard SGD. + + +## Project status +A formula for generating synthetic objective functions has been created. These functions are differentiable and their dimensionality and degree of non-linearity can be controlled with hyperparameters. + +A proof of concept optimizer trained with supervised learning has shown that the approach does indeed generalize well but currently performs no better than SGD. An alternative approach using reinforcement learning is theoretically superior as it does not have to approximate the task's objective but has not produced good results so far. + + +## References + - [Andrychowicz, M., Denil, M., Gomez, S., Hoffman, MW., Pfau, D., Schaul, T., and de Freitas, N. Learning to learn by gradient descent by gradient descent.](https://arxiv.org/pdf/1606.04474v1.pdf) + - [Kingma, D. and Ba, J. Adam: A Method for Stochastic Optimization](https://arxiv.org/abs/1412.6980) + - [Koushik, J. and Hayashi, H. Improving Stochastic Gradient Descent with Feedback](https://arxiv.org/pdf/1611.01505.pdf)