Coherent Gradients: An Approach to Understanding Generalization in Gradient Descent-based Optimization

Authors: Satrajit Chatterjee

ICLR 2020 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental We support this hypothesis with heuristic arguments and perturbative experiments and outline how this can explain several common empirical observations about Deep Learning. Furthermore, our analysis is not just descriptive, but prescriptive. It suggests a natural modification to gradient descent that can greatly reduce overfitting.
Researcher Affiliation Industry Satrajit Chatterjee Google AI Mountain View, CA 94043, USA schatter@google.com
Pseudocode No The paper describes methods and computations in narrative text and mathematical formulas (e.g., for winsorized gradient computation in Section 3) but does not provide structured pseudocode or algorithm blocks.
Open Source Code No The paper does not contain any statements about releasing code for the described methodology or provide a link to an open-source repository.
Open Datasets Yes For our baseline, we use the standard MNIST dataset of 60,000 training examples and 10,000 test examples.
Dataset Splits No The paper mentions '60,000 training examples and 10,000 test examples' for MNIST. While Figure 1(b) is labeled 'Validation accuracy', the paper does not explicitly describe a separate validation dataset split (e.g., in terms of percentage or number of samples) distinct from the training and test sets.
Hardware Specification No The paper does not explicitly provide specific hardware details such as GPU models, CPU types, or memory specifications used for running the experiments. It only vaguely mentions 'smaller carbon footprint' implying less demanding hardware, but without specifics.
Software Dependencies No The paper describes the learning algorithm (vanilla SGD, cross entropy loss) and network components (ReLUs, softmax) but does not specify software dependencies with version numbers, such as programming languages, deep learning frameworks (e.g., TensorFlow, PyTorch), or specific library versions used for implementation.
Experiment Setup Yes The network has one hidden layer with 2048 Re LUs and an output layer with a 10-way softmax. We initialize it with Xavier and train using vanilla SGD (i.e., no momentum) using cross entropy loss with a constant learning rate of 0.1 and a minibatch size of 100 for 10^5 steps (i.e., about 170 epochs). In the case of Winsorized SGD, we use a smaller network with 3 hidden layers of 256 Re LUs each, and train for 60,000 steps (i.e., 100 epochs) with a fixed learning rate of 0.1. We train on the baseline dataset and the 4 noisy variants with c {0, 1, 2, 4, 8}.