Connections between Stochasticity of SGD and Generalizability
Presented at Microsoft Research Lab - India, Bengaluru, India, 2019
This is an attempt to understand how stochasticity in an optimization algorithm affect generalization properties of a Neural Network. Current work includes understanding the distribution of Stochastic Gradient Noise as it directly correlates to generalization under most empirical settings. There has been a recent interest in understanding the effects of distributional properties of SGD as it affects the optimization dynamics in finding good local minima (which have superior generalization properties). With the use of hypothesis testing, we observe that for batch sizes \(256\) and above, the distribution is best described as a Gaussian at-least in the early phases of training. This holds across data-sets, architectures, and other choices. An inetersting direction here is to understand what distributional characterstics in an optimization algorithm are needed for generalizability on loss landscapes like that of a general Neural Network.
The current results have been presented at the workshop Science meets Engineering of Deep Learning at Neural Information Processing Systems 2019.
The paper can be found here.
Leave a Comment