Improving Convergence and Generalization Using Parameter Symmetries
Authors: Bo Zhao, Robert M. Gower, Robin Walters, Rose Yu
ICLR 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | We empirically verify that certain sharpness metrics are correlated with generalization (Keskar et al., 2017), although teleporting towards flatter regions has negligible effects on the validation loss. Additionally, we hypothesize that generalization also depends on the curvature of minima. For fully connected networks, we derive an explicit expression for estimating curvatures and show that teleporting towards larger curvatures improves the model s generalizability. Experimentally, teleportation improves the convergence speed for these algorithms. |
| Researcher Affiliation | Academia | Bo Zhao University of California San Diego bozhao@ucsd.edu Robert M. Gower Flatiron Institute rgower@flatironinstitute.org Robin Walters Northeastern University r.walters@northeastern.edu Rose Yu University of California San Diego roseyu@ucsd.edu |
| Pseudocode | Yes | Algorithm 1 Learning to teleport |
| Open Source Code | Yes | The code used for our experiments is available at: https://github.com/Rose-STL-Lab/Teleportation-Optimization. |
| Open Datasets | Yes | We verify the correlation between sharpness, curvatures, and validation loss on MNIST (Deng, 2012), Fashion-MNIST (Xiao et al., 2017), and CIFAR-10 (Krizhevsky et al., 2009). |
| Dataset Splits | No | For all three datasets (MNIST, Fashion MNIST, and CIFAR-10), we train on 50,000 samples and test on a different set of 10,000 samples. While the paper frequently refers to 'validation loss', it does not explicitly state the size or method of creating a validation split. |
| Hardware Specification | No | The paper does not provide specific hardware details (e.g., CPU/GPU models, memory) used for running the experiments. |
| Software Dependencies | No | The paper does not list specific software dependencies with version numbers (e.g., Python, PyTorch, TensorFlow versions). |
| Experiment Setup | Yes | For MNIST and Fashion-MNIST, d = 28^2, h1 = 16, and h2 = 10. For CIFAR-10, d = 32^3 * 3, h1 = 128, and h2 = 32. The learning rate for stochastic gradient descent is 0.01 for MNIST and Fashion-MNIST, and 0.02 for CIFAR-10. We train each model using mini-batches of size 20 for 40 epochs. The learning rates are 10^-4 for Ada Grad, and 5 * 10^-2 for SGD with momentum, RMSProp, and Adam. The learning rate for optimizing the group element in teleportation is 5 * 10^-2, and we perform 10 gradient ascent steps when teleporting using each mini-batch. |