reproducibilityindex.ai

Improving Convergence and Generalization Using Parameter Symmetries

Authors: Bo Zhao, Robert M. Gower, Robin Walters, Rose Yu

ICLR 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	We empirically verify that certain sharpness metrics are correlated with generalization (Keskar et al., 2017), although teleporting towards flatter regions has negligible effects on the validation loss. Additionally, we hypothesize that generalization also depends on the curvature of minima. For fully connected networks, we derive an explicit expression for estimating curvatures and show that teleporting towards larger curvatures improves the model s generalizability. Experimentally, teleportation improves the convergence speed for these algorithms.
Researcher Affiliation	Academia	Bo Zhao University of California San Diego bozhao@ucsd.edu Robert M. Gower Flatiron Institute rgower@flatironinstitute.org Robin Walters Northeastern University r.walters@northeastern.edu Rose Yu University of California San Diego roseyu@ucsd.edu
Pseudocode	Yes	Algorithm 1 Learning to teleport
Open Source Code	Yes	The code used for our experiments is available at: https://github.com/Rose-STL-Lab/Teleportation-Optimization.
Open Datasets	Yes	We verify the correlation between sharpness, curvatures, and validation loss on MNIST (Deng, 2012), Fashion-MNIST (Xiao et al., 2017), and CIFAR-10 (Krizhevsky et al., 2009).
Dataset Splits	No	For all three datasets (MNIST, Fashion MNIST, and CIFAR-10), we train on 50,000 samples and test on a different set of 10,000 samples. While the paper frequently refers to 'validation loss', it does not explicitly state the size or method of creating a validation split.
Hardware Specification	No	The paper does not provide specific hardware details (e.g., CPU/GPU models, memory) used for running the experiments.
Software Dependencies	No	The paper does not list specific software dependencies with version numbers (e.g., Python, PyTorch, TensorFlow versions).
Experiment Setup	Yes	For MNIST and Fashion-MNIST, d = 28^2, h1 = 16, and h2 = 10. For CIFAR-10, d = 32^3 * 3, h1 = 128, and h2 = 32. The learning rate for stochastic gradient descent is 0.01 for MNIST and Fashion-MNIST, and 0.02 for CIFAR-10. We train each model using mini-batches of size 20 for 40 epochs. The learning rates are 10^-4 for Ada Grad, and 5 * 10^-2 for SGD with momentum, RMSProp, and Adam. The learning rate for optimizing the group element in teleportation is 5 * 10^-2, and we perform 10 gradient ascent steps when teleporting using each mini-batch.