Gradient Descent: The Ultimate Optimizer

Authors: Kartik Chandra, Audrey Xie, Jonathan Ragan-Kelley, ERIK MEIJER

NeurIPS 2022 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental We present experiments validating this for MLPs, CNNs, and RNNs.
Researcher Affiliation Collaboration Kartik Chandra MIT CSAIL Cambridge, MA kach@csail.mit.edu Audrey Xie MIT CSAIL Cambridge, MA ahx@csail.mit.edu Jonathan Ragan-Kelley MIT CSAIL Cambridge, MA jrk@csail.mit.edu Erik Meijer Meta, Inc. Menlo Park, CA erikm@fb.com Equal contribution. Work done in part at Meta, Inc. and in part at Stanford University.
Pseudocode Yes Below is pseudocode for an SGD optimizer that uses .detach() as we have discussed. The highlighted calls to .detach() correspond to detaching the weights and their gradients. def SGD.__init__(self, alpha): self.alpha = alpha def SGD.step(w): d_w = w.grad.detach() w = w.detach() self.alpha.detach() * d_w
Open Source Code Yes Finally, we provide a simple Py Torch implementation of this algorithm (see people.csail.mit.edu/kach/gradient-descent-the-ultimate-optimizer).
Open Datasets Yes We conducted initial experiments on the MNIST dataset (Lecun et al., 1998)... We train a Res Net-20 (He et al., 2016) with and without hyperoptimization on the CIFAR-10 dataset (Krizhevsky, 2012)... We train a character-level RNN ('Char-RNN') on the Tolstoy dataset, as proposed by Karpathy et al. (2015)...
Dataset Splits No The paper mentions using well-known datasets like MNIST, CIFAR-10, and Tolstoy, but it does not explicitly provide details about specific training, validation, and test splits (e.g., percentages, sample counts, or citations to standard split methodologies) within the text.
Hardware Specification Yes Each of these experiments was conducted on a single NVIDIA TITAN Xp GPU.
Software Dependencies No The paper mentions 'Py Torch' and references a specific commit hash for an SGD optimizer file ('optim/sgd.py, commit ff94c9d'), but it does not provide explicit version numbers for PyTorch or any other software libraries or dependencies used in the experiments.
Experiment Setup Yes We conducted initial experiments on the MNIST dataset... using a neural network with one fully-connected hidden layer of size 128, tanh activations, and a batch size of 256. We trained all networks for 30 epochs... For Res Net-20 on CIFAR-10: optimizer (SGD), step size (0.1), momentum (0.9), and weight decay (10 4)... Experiments were run for 200 epochs... For Char-RNN on Tolstoy: 2-layer LSTM with 128 hidden nodes... Adam optimizer with = 2 10 3, run for 50,000 gradient descent steps.