1
|
Mehta D, Chen T, Tang T, Hauenstein JD. The Loss Surface of Deep Linear Networks Viewed Through the Algebraic Geometry Lens. IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE 2022; 44:5664-5680. [PMID: 33822722 DOI: 10.1109/tpami.2021.3071289] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Abstract] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 06/12/2023]
Abstract
By using the viewpoint of modern computational algebraic geometry, we explore properties of the optimization landscapes of deep linear neural network models. After providing clarification on the various definitions of "flat" minima, we show that the geometrically flat minima, which are merely artifacts of residual continuous symmetries of the deep linear networks, can be straightforwardly removed by a generalized L2-regularization. Then, we establish upper bounds on the number of isolated stationary points of these networks with the help of algebraic geometry. Combining these upper bounds with a method in numerical algebraic geometry, we find all stationary points for modest depth and matrix size. We demonstrate that, in the presence of the non-zero regularization, deep linear networks can indeed possess local minima which are not global minima. Finally, we show that even though the number of stationary points increases as the number of neurons (regularization parameters) increases (decreases), higher index saddles are surprisingly rare.
Collapse
|
2
|
Niroomand MP, Cafolla CT, Morgan JWR, Wales DJ. Characterising the area under the curve loss function landscape. MACHINE LEARNING: SCIENCE AND TECHNOLOGY 2022. [DOI: 10.1088/2632-2153/ac49a9] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/11/2022] Open
Abstract
Abstract
One of the most common metrics to evaluate neural network classifiers is the area under the receiver operating characteristic curve (AUC). However, optimisation of the AUC as the loss function during network training is not a standard procedure. Here we compare minimising the cross-entropy (CE) loss and optimising the AUC directly. In particular, we analyse the loss function landscape (LFL) of approximate AUC (appAUC) loss functions to discover the organisation of this solution space. We discuss various surrogates for AUC approximation and show their differences. We find that the characteristics of the appAUC landscape are significantly different from the CE landscape. The approximate AUC loss function improves testing AUC, and the appAUC landscape has substantially more minima, but these minima are less robust, with larger average Hessian eigenvalues. We provide a theoretical foundation to explain these results. To generalise our results, we lastly provide an overview of how the LFL can help to guide loss function analysis and selection.
Collapse
|
3
|
Dennis C, Engelbrecht A, Ombuki-Berman BM. An analysis of the impact of subsampling on the neural network error surface. Neurocomputing 2021. [DOI: 10.1016/j.neucom.2021.09.023] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 10/20/2022]
|
4
|
Frye CG, Simon J, Wadia NS, Ligeralde A, DeWeese MR, Bouchard KE. Critical Point-Finding Methods Reveal Gradient-Flat Regions of Deep Network Losses. Neural Comput 2021; 33:1469-1497. [PMID: 34496389 DOI: 10.1162/neco_a_01388] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/23/2020] [Accepted: 01/11/2021] [Indexed: 11/04/2022]
Abstract
Despite the fact that the loss functions of deep neural networks are highly nonconvex, gradient-based optimization algorithms converge to approximately the same performance from many random initial points. One thread of work has focused on explaining this phenomenon by numerically characterizing the local curvature near critical points of the loss function, where the gradients are near zero. Such studies have reported that neural network losses enjoy a no-bad-local-minima property, in disagreement with more recent theoretical results. We report here that the methods used to find these putative critical points suffer from a bad local minima problem of their own: they often converge to or pass through regions where the gradient norm has a stationary point. We call these gradient-flat regions, since they arise when the gradient is approximately in the kernel of the Hessian, such that the loss is locally approximately linear, or flat, in the direction of the gradient. We describe how the presence of these regions necessitates care in both interpreting past results that claimed to find critical points of neural network losses and in designing second-order methods for optimizing neural networks.
Collapse
Affiliation(s)
- Charles G Frye
- Redwood Center for Theoretical Neuroscience and Helen Wills Neuroscience Institute, University of California, Berkeley, CA 94720, U.S.A.
| | - James Simon
- Redwood Center for Theoretical Neuroscience and Department of Physics, University of California, Berkeley, CA 94720, U.S.A.
| | - Neha S Wadia
- Redwood Center for Theoretical Neuroscience and Biophysics Graduate Group, University of California, Berkeley, CA 94720, U.S.A.
| | - Andrew Ligeralde
- Redwood Center for Theoretical Neuroscience and Biophysics Graduate Group, University of California, Berkeley, CA 94720, U.S.A.
| | - Michael R DeWeese
- Redwood Center for Theoretical Neuroscience, Helen Wills Neuroscience Institute, Department of Physics, and Biophysics Graduate Group, University of California, Berkeley, CA 94720, U.S.A.
| | - Kristofer E Bouchard
- Redwood Center for Theoretical Neuroscience and Helen Wills Neuroscience Institute, University of California, Berkeley, CA 94720, USA; and Biological Systems and Engineering Division and Computational Research Division, Lawrence Berkeley National Lab, Berkeley, CA 94720, U.S.A.
| |
Collapse
|
5
|
Verpoort PC, Lee AA, Wales DJ. Archetypal landscapes for deep neural networks. Proc Natl Acad Sci U S A 2020; 117:21857-21864. [PMID: 32843349 PMCID: PMC7486703 DOI: 10.1073/pnas.1919995117] [Citation(s) in RCA: 6] [Impact Index Per Article: 1.2] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/18/2022] Open
Abstract
The predictive capabilities of deep neural networks (DNNs) continue to evolve to increasingly impressive levels. However, it is still unclear how training procedures for DNNs succeed in finding parameters that produce good results for such high-dimensional and nonconvex loss functions. In particular, we wish to understand why simple optimization schemes, such as stochastic gradient descent, do not end up trapped in local minima with high loss values that would not yield useful predictions. We explain the optimizability of DNNs by characterizing the local minima and transition states of the loss-function landscape (LFL) along with their connectivity. We show that the LFL of a DNN in the shallow network or data-abundant limit is funneled, and thus easy to optimize. Crucially, in the opposite low-data/deep limit, although the number of minima increases, the landscape is characterized by many minima with similar loss values separated by low barriers. This organization is different from the hierarchical landscapes of structural glass formers and explains why minimization procedures commonly employed by the machine-learning community can navigate the LFL successfully and reach low-lying solutions.
Collapse
Affiliation(s)
- Philipp C Verpoort
- Department of Physics, University of Cambridge, Cambridge CB3 0HE, United Kingdom;
| | - Alpha A Lee
- Department of Physics, University of Cambridge, Cambridge CB3 0HE, United Kingdom
| | - David J Wales
- Department of Chemistry, University of Cambridge, Cambridge CB2 1EW, United Kingdom
| |
Collapse
|
6
|
Bosman AS, Engelbrecht A, Helbig M. Visualising basins of attraction for the cross-entropy and the squared error neural network loss functions. Neurocomputing 2020. [DOI: 10.1016/j.neucom.2020.02.113] [Citation(s) in RCA: 9] [Impact Index Per Article: 1.8] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 10/24/2022]
|
7
|
Chitturi SR, Verpoort PC, Lee AA, Wales DJ. Perspective: new insights from loss function landscapes of neural networks. MACHINE LEARNING: SCIENCE AND TECHNOLOGY 2020. [DOI: 10.1088/2632-2153/ab7aef] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.2] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/11/2022] Open
Abstract
Abstract
We investigate the structure of the loss function landscape for neural networks subject to dataset mislabelling, increased training set diversity, and reduced node connectivity, using various techniques developed for energy landscape exploration. The benchmarking models are classification problems for atomic geometry optimisation and hand-written digit prediction. We consider the effect of varying the size of the atomic configuration space used to generate initial geometries and find that the number of stationary points increases rapidly with the size of the training configuration space. We introduce a measure of node locality to limit network connectivity and perturb permutational weight symmetry, and examine how this parameter affects the resulting landscapes. We find that highly-reduced systems have low capacity and exhibit landscapes with very few minima. On the other hand, small amounts of reduced connectivity can enhance network expressibility and can yield more complex landscapes. Investigating the effect of deliberate classification errors in the training data, we find that the variance in testing AUC, computed over a sample of minima, grows significantly with the training error, providing new insight into the role of the variance-bias trade-off when training under noise. Finally, we illustrate how the number of local minima for networks with two and three hidden layers, but a comparable number of variable edge weights, increases significantly with the number of layers, and as the number of training data decreases. This work helps shed further light on neural network loss landscapes and provides guidance for future work on neural network training and optimisation.
Collapse
|
8
|
Becker S, Zhang Y, Lee AA. Geometry of Energy Landscapes and the Optimizability of Deep Neural Networks. PHYSICAL REVIEW LETTERS 2020; 124:108301. [PMID: 32216422 DOI: 10.1103/physrevlett.124.108301] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.4] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Received: 08/09/2018] [Revised: 11/13/2019] [Accepted: 02/06/2020] [Indexed: 06/10/2023]
Abstract
Deep neural networks are workhorse models in machine learning with multiple layers of nonlinear functions composed in series. Their loss function is highly nonconvex, yet empirically even gradient descent minimization is sufficient to arrive at accurate and predictive models. It is hitherto unknown why deep neural networks are easily optimizable. We analyze the energy landscape of a spin glass model of deep neural networks using random matrix theory and algebraic geometry. We analytically show that the multilayered structure holds the key to optimizability: Fixing the number of parameters and increasing network depth, the number of stationary points in the loss function decreases, minima become more clustered in parameter space, and the trade-off between the depth and width of minima becomes less severe. Our analytical results are numerically verified through comparison with neural networks trained on a set of classical benchmark datasets. Our model uncovers generic design principles of machine learning models.
Collapse
Affiliation(s)
- Simon Becker
- Department of Applied Mathematics and Theoretical Physics, University of Cambridge, Wilberforce Road, Cambridge CB3 0WA, United Kingdom
| | - Yao Zhang
- Department of Applied Mathematics and Theoretical Physics, University of Cambridge, Wilberforce Road, Cambridge CB3 0WA, United Kingdom
- Cavendish Laboratory, University of Cambridge, Cambridge CB3 0HE, United Kingdom
| | - Alpha A Lee
- Cavendish Laboratory, University of Cambridge, Cambridge CB3 0HE, United Kingdom
| |
Collapse
|