Improving Convergence and Generalization Using Parameter Symmetries

Authors: Bo Zhao, Robert M. Gower, Robin Walters, Rose Yu

ICLR 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental We empirically verify that certain sharpness metrics are correlated with generalization (Keskar et al., 2017), although teleporting towards flatter regions has negligible effects on the validation loss. Additionally, we hypothesize that generalization also depends on the curvature of minima. For fully connected networks, we derive an explicit expression for estimating curvatures and show that teleporting towards larger curvatures improves the model s generalizability. Experimentally, teleportation improves the convergence speed for these algorithms.
Researcher Affiliation Academia Bo Zhao University of California San Diego bozhao@ucsd.edu Robert M. Gower Flatiron Institute rgower@flatironinstitute.org Robin Walters Northeastern University r.walters@northeastern.edu Rose Yu University of California San Diego roseyu@ucsd.edu
Pseudocode Yes Algorithm 1 Learning to teleport
Open Source Code Yes The code used for our experiments is available at: https://github.com/Rose-STL-Lab/Teleportation-Optimization.
Open Datasets Yes We verify the correlation between sharpness, curvatures, and validation loss on MNIST (Deng, 2012), Fashion-MNIST (Xiao et al., 2017), and CIFAR-10 (Krizhevsky et al., 2009).
Dataset Splits No For all three datasets (MNIST, Fashion MNIST, and CIFAR-10), we train on 50,000 samples and test on a different set of 10,000 samples. While the paper frequently refers to 'validation loss', it does not explicitly state the size or method of creating a validation split.
Hardware Specification No The paper does not provide specific hardware details (e.g., CPU/GPU models, memory) used for running the experiments.
Software Dependencies No The paper does not list specific software dependencies with version numbers (e.g., Python, PyTorch, TensorFlow versions).
Experiment Setup Yes For MNIST and Fashion-MNIST, d = 28^2, h1 = 16, and h2 = 10. For CIFAR-10, d = 32^3 * 3, h1 = 128, and h2 = 32. The learning rate for stochastic gradient descent is 0.01 for MNIST and Fashion-MNIST, and 0.02 for CIFAR-10. We train each model using mini-batches of size 20 for 40 epochs. The learning rates are 10^-4 for Ada Grad, and 5 * 10^-2 for SGD with momentum, RMSProp, and Adam. The learning rate for optimizing the group element in teleportation is 5 * 10^-2, and we perform 10 gradient ascent steps when teleporting using each mini-batch.