Understanding Why Neural Networks Generalize Well Through GSNR of Parameters
Authors: Jinlong Liu, Yunzhi Bai, Guoqing Jiang, Ting Chen, Huayan Wang
ICLR 2020 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | In the remainder of this paper we first analyze the relation between GSNR and generalization (Section 2). We then show how the training dynamics lead to large GSNR of model parameters experimentally and analytically in Section 3. We performed the above estimations on MNIST with a simple CNN structure consists of 2 Conv-Relu-Max Pooling blocks and 2 fully-connected layers. First, to estimate eq. (24) with M = 10, we randomly sample 10 training sets with size n and a test set with size 10,000. |
| Researcher Affiliation | Industry | Jinlong Liu1 , Guo-qing Jiang1, Yunzhi Bai1, Ting Chen2, and Huayan Wang1 1Ytech KWAI incorporation {liujinlong,jiangguoqing,baiyunzhi,wanghuayan}@kuaishou.com 2Samsung Research China Beijing (SRC-B) ting11.chen@samsung.com |
| Pseudocode | No | The paper includes mathematical equations and derivations, but it does not provide any pseudocode or clearly labeled algorithm blocks. |
| Open Source Code | No | The paper does not contain any statements or links indicating that open-source code for the methodology described is available. |
| Open Datasets | Yes | We performed the above estimations on MNIST with a simple CNN structure... We also conducted the same experiment on CIFAR10 A.2 and a toy dataset A.3 observed the same behavior. |
| Dataset Splits | No | The paper mentions 'training set' and 'test set' and their sizes (e.g., 'training set D = {(x1, y1), ..., (xn, yn)} Zn' and 'test set D = {(x 1, y 1), ..., (x n , y n )} Zn' in Section 2.2; 'The training set and test set sizes are 200 and 10,000, respectively' in Section 3.2), but it does not specify a validation set or a training/validation/test split. |
| Hardware Specification | No | The paper does not specify any details about the hardware (e.g., CPU, GPU models, memory) used to run the experiments. |
| Software Dependencies | No | The paper mentions using 'gradient descent training' and different network structures (CNN, MLP), but it does not specify any particular software libraries, frameworks, or their version numbers (e.g., PyTorch, TensorFlow, scikit-learn, Python version). |
| Experiment Setup | Yes | To cover different conditions, we (1) choose n {1000, 2000, 4000, 6000, 8000, 10000, 15000}, respectively; (2) inject noise by randomly changing the labels with probability prandom {0.0, 0.1, 0.2, 0.3, 0.5}; (3) change the model structure by varying number of channels in the layers, ch {6, 8, 10, 12, 14, 16, 18, 20}. See Appendix A for more details of the setup. We use the gradient descent training (not SGD), with a small learning rate of 0.001. |