Coherent Gradients: An Approach to Understanding Generalization in Gradient Descent-based Optimization
Authors: Satrajit Chatterjee
ICLR 2020 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | We support this hypothesis with heuristic arguments and perturbative experiments and outline how this can explain several common empirical observations about Deep Learning. Furthermore, our analysis is not just descriptive, but prescriptive. It suggests a natural modification to gradient descent that can greatly reduce overfitting. |
| Researcher Affiliation | Industry | Satrajit Chatterjee Google AI Mountain View, CA 94043, USA schatter@google.com |
| Pseudocode | No | The paper describes methods and computations in narrative text and mathematical formulas (e.g., for winsorized gradient computation in Section 3) but does not provide structured pseudocode or algorithm blocks. |
| Open Source Code | No | The paper does not contain any statements about releasing code for the described methodology or provide a link to an open-source repository. |
| Open Datasets | Yes | For our baseline, we use the standard MNIST dataset of 60,000 training examples and 10,000 test examples. |
| Dataset Splits | No | The paper mentions '60,000 training examples and 10,000 test examples' for MNIST. While Figure 1(b) is labeled 'Validation accuracy', the paper does not explicitly describe a separate validation dataset split (e.g., in terms of percentage or number of samples) distinct from the training and test sets. |
| Hardware Specification | No | The paper does not explicitly provide specific hardware details such as GPU models, CPU types, or memory specifications used for running the experiments. It only vaguely mentions 'smaller carbon footprint' implying less demanding hardware, but without specifics. |
| Software Dependencies | No | The paper describes the learning algorithm (vanilla SGD, cross entropy loss) and network components (ReLUs, softmax) but does not specify software dependencies with version numbers, such as programming languages, deep learning frameworks (e.g., TensorFlow, PyTorch), or specific library versions used for implementation. |
| Experiment Setup | Yes | The network has one hidden layer with 2048 Re LUs and an output layer with a 10-way softmax. We initialize it with Xavier and train using vanilla SGD (i.e., no momentum) using cross entropy loss with a constant learning rate of 0.1 and a minibatch size of 100 for 10^5 steps (i.e., about 170 epochs). In the case of Winsorized SGD, we use a smaller network with 3 hidden layers of 256 Re LUs each, and train for 60,000 steps (i.e., 100 epochs) with a fixed learning rate of 0.1. We train on the baseline dataset and the 4 noisy variants with c {0, 1, 2, 4, 8}. |