This is an attempt to understand how stochasticity in an optimization algorithm affect generalization properties of a Neural Network. Current work includes understanding the distribution of Stochastic Gradient Noise as it directly correlates to generalization under most empirical settings. There has been a recent interest in understanding the effects of distributional properties of SGD as it affects the optimization dynamics in finding good local minima (which have superior generalization properties). With the use of hypothesis testing, we observe that for batch sizes \(256\) and above, the distribution is best described as a Gaussian at-least in the early phases of training. This holds across data-sets, architectures, and other choices. An inetersting direction here is to understand what distributional characterstics in an optimization algorithm are needed for generalizability on loss landscapes like that of a general Neural Network.
The paper can be found here.