reproducibilityindex.ai

Simplifying Momentum-based Positive-definite Submanifold Optimization with Applications to Deep Learning

Authors: Wu Lin, Valentin Duruisseaux, Melvin Leok, Frank Nielsen, Mohammad Emtiyaz Khan, Mark Schmidt

ICML 2023 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	5. Numerical Results To validate our proposed updates, we consider several optimization problems. 5.2. Results in Deep Learning We consider image classification tasks with NN architectures ranging from classical to modern models: VGG-16... We consider three complex datasets CIFAR-100 , Tiny Image Net200 , and Image Net-100 .
Researcher Affiliation	Collaboration	1University of British Columbia, Vancouver, Canada 2University of California San Diego, San Diego, USA 3Sony Computer Science Laboratories Inc., Tokyo, Japan 4RIKEN Center for Advanced Intelligence Project, Tokyo, Japan 5CIFAR AI Chair, Alberta Machine Intelligence Institute, Alberta, Canada.
Pseudocode	Yes	Figure 2. In our update, we denote HK := KT µAAK , HC := CT µGGC , κ2 := λTr(KT K) , and c2 := λTr(CT C), where vec 1(µ) Rd p, C Rd d, K Rp p. Note that we merge factors 1 2 d and 1 2 p in Eq. (23) into the updates in m K and m C, respectively (see Eq. (86) in Appx. I for a justification). We use the linear truncation of the matrix exponential function. Our update does not require explicit matrix inverses. We can also pre-compute CCT and KKT when T > 1. In KFAC, a damping term λI is introduced to handle the singularity of CCT 1. We introduce a similar damping term in κ2 and c2 (see Appx. I for a derivation) to improve numerical stability. Our update and KFAC include momentum weight α2 for layer-wise NN weights µ and (L2) weight decay γ. In our update, we also introduce momentum weight α1 in the SPD preconditioner. Our update can use a larger stepsize β2 than KFAC.
Open Source Code	No	The paper discusses methods and experiments but does not provide a statement or link for the open-source code of its methodology. Footnotes 4 and 5 link to dataset repositories, not the authors' code.
Open Datasets	Yes	We consider three complex datasets CIFAR-100 , Tiny Image Net200 4, and Image Net-100 5. Footnote 4: github.com/tjmoon0104/pytorch-tiny-imagenet. Footnote 5: kaggle.com/datasets/ambityga/imagenet100.
Dataset Splits	Yes	Table 7. Statistics of the Datasets, which lists "Number of Training Points" and "Number of Test Points" for CIFAR-100 (50,000 / 10,000), Tiny Image Net-200 (100,000 / 5,000), and Image Net-100 (130,000 / 5,000).
Hardware Specification	No	The paper does not explicitly describe the specific hardware used (e.g., GPU/CPU models, memory details) for running its experiments.
Software Dependencies	No	The paper mentions usage of PyTorch implicitly via a dataset link (github.com/tjmoon0104/pytorch-tiny-imagenet), but it does not specify version numbers for PyTorch or any other software dependencies crucial for replication.
Experiment Setup	Yes	The hyper-parameter configuration of our update and KFAC can be found in Table 5 in Appx. A. Table 5. Hyperparameter configuration in our update and KFAC. We first choose the damping weight λ based on the performance of KFAC and use the same value in our update. For both methods, we set λ = 0.01, 0.0005, 0.005 in VGG16, Rep VGG-B1G4, and other models, respectively. To reduce the iteration cost of both methods, we update the preconditioner at every T = 60, 25, 10 iterations for Rep VGG-B1G4, Reg Net Z-500MF, and other models, respectively. The value of the hyperparameter θ is chosen as suggested at https://github.com/alecwangcq/KFAC-Pytorch. Since we do not use pre-training, we consider the first 500 iterations as a warm-up period to update our preconditioner by using a smaller stepsize β1: we set β1 = 0.0002 for the first 100 iterations, increase it to β1 = 0.002 for the next 400 iterations, and finally fix it to β1 = 0.01 for the remaining iterations.