[ML Seminars] Why does standard gradient descent work so well in practice?
Tuyen Trung Truong talks about the gradient descent method, and why such a simple method performs so well in practice.
The standard gradient descent is a popular optimisation method used in many fields, including Deep Learning. Introduced by Cauchy since 1847, many of its properties have been discovered but many more need to be found. In particular, it has been a debate of why this simple method works so efficiently (most of the time it converges to a minimum point). In this talk, I will present a brief overview of current status of gradient descent methods, in theory and also in practice in Deep Learning, in particular my recent joint work (available at: arXiv: 1808.05160) with Tuan Hang Nguyen (AXON AI Research). In this work, we prove that for most function (including all Morse functions), the backtracking variant of gradient descent either converges to a single critical point or diverges to infinity, and we also illustrate how it can be used very efficiently in Deep Learning (in particular, helps to avoid the practice of manual fine-tuning of learning rates). This result can be used to provide a heuristic explanation to the question in the title.