Identifying Equivalent Training Dynamics
Authors: William Redman, Juan Bello-Rivas, Maria Fonoberova, Ryan Mohr, Yannis Kevrekidis, Igor Mezic
NeurIPS 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | To validate our approach, we demonstrate that comparing Koopman eigenvalues can correctly identify a known equivalence between online mirror descent and online gradient descent. We then utilize our approach to: (a) identify non-conjugate training dynamics between shallow and wide fully connected neural networks; (b) characterize the early phase of training dynamics in convolutional neural networks; (c) uncover non-conjugate training dynamics in Transformers that do and do not undergo grokking. |
| Researcher Affiliation | Collaboration | William T. Redman AIMdyn Inc. UC Santa Barbara Juan Bello-Rivas Johns Hopkins University Maria Fonoberova AIMdyn Inc. Ryan Mohr AIMdyn Inc. Yannis G. Kevrekidis Johns Hopkins University Igor Mezic AIMdyn Inc. UC Santa Barbara |
| Pseudocode | Yes | Algorithm 1 Online Mirror Descent [26] 0: Input: x(0) K, R, η, f 0: for t = 0, ..., T 1 do 0: y(t+1) = ( R) 1 ( R[x(t)] η f[x(t)]) 0: x(t + 1) = ΠR K[y(t + 1)] |
| Open Source Code | Yes | Code implementing our experiments can be found at https://github.com/william-redman/Identifying_Equivalent_Training_Dynamics. |
| Open Datasets | Yes | FCNs with only a single hidden layer, trained on MNIST (Appendix C.1). ... Le Net [60], a simple CNN trained on MNIST, and Res Net-20 [61], trained on CIFAR-10 (see Appendix D.1 for details). ... Transformers trained on algorithmic data (e.g., modular addition)... |
| Dataset Splits | No | The paper mentions training and testing but does not explicitly provide details about the train/validation/test dataset splits with percentages or sample counts. |
| Hardware Specification | Yes | All experiments were run on a Mac Book Air with an Apple M1 chip, 1 CPU, and no GPUs. |
| Software Dependencies | No | The paper mentions the use of PyTorch and links to external codebases (Shrink Bench framework, Omnigrok) but does not provide specific version numbers for these or other software dependencies. |
| Experiment Setup | Yes | Table S1: Hyper-parameters used for FCN training in Sec. 4.2. Hyper-parameters: Learning rate (η) 0.1, Batch size (b) 60, Optimizer SGD, Epochs 1, Activation function Re LU. ... Table S2: Hyper-parameters used for CNN training in Sec. 4.3. Hyper-parameters: Learning rate (η) 0.0012, Batch size (b) 60, Optimizer Adam, Epochs 20, Activation function Re LU. |