Identifying Equivalent Training Dynamics

Authors: William Redman, Juan Bello-Rivas, Maria Fonoberova, Ryan Mohr, Yannis Kevrekidis, Igor Mezic

NeurIPS 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental To validate our approach, we demonstrate that comparing Koopman eigenvalues can correctly identify a known equivalence between online mirror descent and online gradient descent. We then utilize our approach to: (a) identify non-conjugate training dynamics between shallow and wide fully connected neural networks; (b) characterize the early phase of training dynamics in convolutional neural networks; (c) uncover non-conjugate training dynamics in Transformers that do and do not undergo grokking.
Researcher Affiliation Collaboration William T. Redman AIMdyn Inc. UC Santa Barbara Juan Bello-Rivas Johns Hopkins University Maria Fonoberova AIMdyn Inc. Ryan Mohr AIMdyn Inc. Yannis G. Kevrekidis Johns Hopkins University Igor Mezic AIMdyn Inc. UC Santa Barbara
Pseudocode Yes Algorithm 1 Online Mirror Descent [26] 0: Input: x(0) K, R, η, f 0: for t = 0, ..., T 1 do 0: y(t+1) = ( R) 1 ( R[x(t)] η f[x(t)]) 0: x(t + 1) = ΠR K[y(t + 1)]
Open Source Code Yes Code implementing our experiments can be found at https://github.com/william-redman/Identifying_Equivalent_Training_Dynamics.
Open Datasets Yes FCNs with only a single hidden layer, trained on MNIST (Appendix C.1). ... Le Net [60], a simple CNN trained on MNIST, and Res Net-20 [61], trained on CIFAR-10 (see Appendix D.1 for details). ... Transformers trained on algorithmic data (e.g., modular addition)...
Dataset Splits No The paper mentions training and testing but does not explicitly provide details about the train/validation/test dataset splits with percentages or sample counts.
Hardware Specification Yes All experiments were run on a Mac Book Air with an Apple M1 chip, 1 CPU, and no GPUs.
Software Dependencies No The paper mentions the use of PyTorch and links to external codebases (Shrink Bench framework, Omnigrok) but does not provide specific version numbers for these or other software dependencies.
Experiment Setup Yes Table S1: Hyper-parameters used for FCN training in Sec. 4.2. Hyper-parameters: Learning rate (η) 0.1, Batch size (b) 60, Optimizer SGD, Epochs 1, Activation function Re LU. ... Table S2: Hyper-parameters used for CNN training in Sec. 4.3. Hyper-parameters: Learning rate (η) 0.0012, Batch size (b) 60, Optimizer Adam, Epochs 20, Activation function Re LU.