Gradient Descent: The Ultimate Optimizer
Authors: Kartik Chandra, Audrey Xie, Jonathan Ragan-Kelley, ERIK MEIJER
NeurIPS 2022 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | We present experiments validating this for MLPs, CNNs, and RNNs. |
| Researcher Affiliation | Collaboration | Kartik Chandra MIT CSAIL Cambridge, MA kach@csail.mit.edu Audrey Xie MIT CSAIL Cambridge, MA ahx@csail.mit.edu Jonathan Ragan-Kelley MIT CSAIL Cambridge, MA jrk@csail.mit.edu Erik Meijer Meta, Inc. Menlo Park, CA erikm@fb.com Equal contribution. Work done in part at Meta, Inc. and in part at Stanford University. |
| Pseudocode | Yes | Below is pseudocode for an SGD optimizer that uses .detach() as we have discussed. The highlighted calls to .detach() correspond to detaching the weights and their gradients. def SGD.__init__(self, alpha): self.alpha = alpha def SGD.step(w): d_w = w.grad.detach() w = w.detach() self.alpha.detach() * d_w |
| Open Source Code | Yes | Finally, we provide a simple Py Torch implementation of this algorithm (see people.csail.mit.edu/kach/gradient-descent-the-ultimate-optimizer). |
| Open Datasets | Yes | We conducted initial experiments on the MNIST dataset (Lecun et al., 1998)... We train a Res Net-20 (He et al., 2016) with and without hyperoptimization on the CIFAR-10 dataset (Krizhevsky, 2012)... We train a character-level RNN ('Char-RNN') on the Tolstoy dataset, as proposed by Karpathy et al. (2015)... |
| Dataset Splits | No | The paper mentions using well-known datasets like MNIST, CIFAR-10, and Tolstoy, but it does not explicitly provide details about specific training, validation, and test splits (e.g., percentages, sample counts, or citations to standard split methodologies) within the text. |
| Hardware Specification | Yes | Each of these experiments was conducted on a single NVIDIA TITAN Xp GPU. |
| Software Dependencies | No | The paper mentions 'Py Torch' and references a specific commit hash for an SGD optimizer file ('optim/sgd.py, commit ff94c9d'), but it does not provide explicit version numbers for PyTorch or any other software libraries or dependencies used in the experiments. |
| Experiment Setup | Yes | We conducted initial experiments on the MNIST dataset... using a neural network with one fully-connected hidden layer of size 128, tanh activations, and a batch size of 256. We trained all networks for 30 epochs... For Res Net-20 on CIFAR-10: optimizer (SGD), step size (0.1), momentum (0.9), and weight decay (10 4)... Experiments were run for 200 epochs... For Char-RNN on Tolstoy: 2-layer LSTM with 128 hidden nodes... Adam optimizer with = 2 10 3, run for 50,000 gradient descent steps. |