Linear regression with PyTorch: LBFGS vs Adam

I am looking at using PyTorch for the machine learning component of one the research projects that I am currently working on. It involves regression with relatively small tabular datasets. The data is produced by computer simulations. Though highly nonlinear there is no noise. I am using shallow (1–2 hidden layers) neural networks for this purpose. This is an atypical use of PyTorch. I did some prototyping with the simpler neural network library from scikit-learn. While the prototype performs as expected scikit-learn does not have GPU support. Hence PyTorch. One question might be why do I need GPU support for shallow models and small datasets. The answer is that even though the models are shallow I am using a bazillion of them for generalization and uncertainty estimation. The approach can benefit from the parallelization offered by GPUs. Anyways it gave me an excuse to play with PyTorch in ways I have not done before.

For scientific research projects, like the one I am involved in, convergence is important. Second-order optimizers like LBFGS or Levenberg-Marquardt are generally better at convergence than first-order optimizers like SGD or Adam. But only a few handful of machine learning libraries include second-order optimizers. Both scikit-learn and PyTorch provide an LBFGS optimizer. Matlab provides Levenberg-Marquardt among others. There is also a relatively less known neural network library called Pyrenn, which provides Levenberg-Marquardt. As far as I know no other machine learning library provides second-order optimizers for training neural networks. This is because second order optimizers require quadratic storage and cubic computation time for each gradient update. This quickly becomes infeasible for large datasets and deep models, which is what most modern machine learning libraries are used for. For my project this is not an issue.

The LBFGS optimizer that comes with PyTorch lacks certain features, such as mini-batch training, and weak Wolfe line search. Mini-batch training is not very important in my case since the datasets are small. I do not know enough about backtracking line search to know if the lack of it would be important. There are other third-party implementations of the LBFGS optimizer for PyTorch which overcome some of these shortcomings. I wanted to eplore the performance of the builtin LBFGS optimizer, and see if it is adequate for my purposes. Given the problem I am trying to solve, the best way to benchmark the accuracy of the optimizer would be to try out simple regression problems with small datasets. Which lead me to this post.

Linear regression is the simplest regression model there is. It is simply a line defined by the equation, y = mx + b. In other words a zero hidden layer model. Determining the slope (m), and the y-intercept (b) from the data, constitutes the learning process. For this post the objective truth function is y = 2x + 2.

LinearModel class

There is only one input (x) and one output (y).

Dummy data

While it is not necessary for this simple example, PyTorch provides a DataSet class, which can be subclassed, and a DataLoader class to elegantly provide the data to the model, during training.

Training

With Adam

With LBFGS

The LBFGS optimizer needs to evaluate the function multiple times. PyTorch documentation says that the user needs to supply a closure function that will allow the optimizer to recompute the function.

Comparison

Woah! Look at that loss! LBFGS takes the loss almost to zero after just twenty epochs. Actually even before that, around five epochs. The loss seems to indicate that LBFGS is performing much better than Adam, but it is best to compare the performances by doing some predictions with the models trained with the different optimizers.

The graph shows what the loss hinted. For this simple model with a small synthetic dataset, LBFGS gives far superior performance than Adam. It is possible that with more training epochs, Adam will converge to a similar result.

With a significantly larger number of training epochs (about 60 times), the performance of Adam comes close to that of LBFGS. It still remains to be seen if the third party implementations of the LBFGS optimizer gives a better performance than the builtin one, and how does LBFGS in general do when faced with nonlinear data, and/or deeper models. But for now LBFGS wins the crown.

Originally published at https://soham.dev on February 9, 2021.

PhD candidate @IowaStateU in #physics | #Toastmaster | Taming dragons on Sundays. https://soham.dev