Simplifying Momentum-based Positive-definite Submanifold Optimization with Applications to Deep Learning
Authors: Wu Lin, Valentin Duruisseaux, Melvin Leok, Frank Nielsen, Mohammad Emtiyaz Khan, Mark Schmidt
ICML 2023 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | 5. Numerical Results To validate our proposed updates, we consider several optimization problems. 5.2. Results in Deep Learning We consider image classification tasks with NN architectures ranging from classical to modern models: VGG-16... We consider three complex datasets CIFAR-100 , Tiny Image Net200 , and Image Net-100 . |
| Researcher Affiliation | Collaboration | 1University of British Columbia, Vancouver, Canada 2University of California San Diego, San Diego, USA 3Sony Computer Science Laboratories Inc., Tokyo, Japan 4RIKEN Center for Advanced Intelligence Project, Tokyo, Japan 5CIFAR AI Chair, Alberta Machine Intelligence Institute, Alberta, Canada. |
| Pseudocode | Yes | Figure 2. In our update, we denote HK := KT µAAK , HC := CT µGGC , κ2 := λTr(KT K) , and c2 := λTr(CT C), where vec 1(µ) Rd p, C Rd d, K Rp p. Note that we merge factors 1 2 d and 1 2 p in Eq. (23) into the updates in m K and m C, respectively (see Eq. (86) in Appx. I for a justification). We use the linear truncation of the matrix exponential function. Our update does not require explicit matrix inverses. We can also pre-compute CCT and KKT when T > 1. In KFAC, a damping term λI is introduced to handle the singularity of CCT 1. We introduce a similar damping term in κ2 and c2 (see Appx. I for a derivation) to improve numerical stability. Our update and KFAC include momentum weight α2 for layer-wise NN weights µ and (L2) weight decay γ. In our update, we also introduce momentum weight α1 in the SPD preconditioner. Our update can use a larger stepsize β2 than KFAC. |
| Open Source Code | No | The paper discusses methods and experiments but does not provide a statement or link for the open-source code of its methodology. Footnotes 4 and 5 link to dataset repositories, not the authors' code. |
| Open Datasets | Yes | We consider three complex datasets CIFAR-100 , Tiny Image Net200 4, and Image Net-100 5. Footnote 4: github.com/tjmoon0104/pytorch-tiny-imagenet. Footnote 5: kaggle.com/datasets/ambityga/imagenet100. |
| Dataset Splits | Yes | Table 7. Statistics of the Datasets, which lists "Number of Training Points" and "Number of Test Points" for CIFAR-100 (50,000 / 10,000), Tiny Image Net-200 (100,000 / 5,000), and Image Net-100 (130,000 / 5,000). |
| Hardware Specification | No | The paper does not explicitly describe the specific hardware used (e.g., GPU/CPU models, memory details) for running its experiments. |
| Software Dependencies | No | The paper mentions usage of PyTorch implicitly via a dataset link (github.com/tjmoon0104/pytorch-tiny-imagenet), but it does not specify version numbers for PyTorch or any other software dependencies crucial for replication. |
| Experiment Setup | Yes | The hyper-parameter configuration of our update and KFAC can be found in Table 5 in Appx. A. Table 5. Hyperparameter configuration in our update and KFAC. We first choose the damping weight λ based on the performance of KFAC and use the same value in our update. For both methods, we set λ = 0.01, 0.0005, 0.005 in VGG16, Rep VGG-B1G4, and other models, respectively. To reduce the iteration cost of both methods, we update the preconditioner at every T = 60, 25, 10 iterations for Rep VGG-B1G4, Reg Net Z-500MF, and other models, respectively. The value of the hyperparameter θ is chosen as suggested at https://github.com/alecwangcq/KFAC-Pytorch. Since we do not use pre-training, we consider the first 500 iterations as a warm-up period to update our preconditioner by using a smaller stepsize β1: we set β1 = 0.0002 for the first 100 iterations, increase it to β1 = 0.002 for the next 400 iterations, and finally fix it to β1 = 0.01 for the remaining iterations. |