reproducibilityindex.ai

Catastrophic Fisher Explosion: Early Phase Fisher Matrix Impacts Generalization

Authors: Stanislaw Jastrzebski, Devansh Arpit, Oliver Astrand, Giancarlo B Kerg, Huan Wang, Caiming Xiong, Richard Socher, Kyunghyun Cho, Krzysztof J Geras

ICML 2021 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	We demonstrate on image classiﬁcation tasks that Tr(F) early in training correlates with the ﬁnal generalization performance across settings with different learning rates or batch sizes. We then show evidence that explicitly regularizing Tr(F), which we call Fisher penalty, recovers generalization degradation due to training with a sub-optimal (small) learning rate, and can signiﬁcantly improve generalization when training with the optimal learning rate. On the other hand, achieving large Tr(F) early in training, which may occur in practice when using a relatively small learning rate, or due to bad initialization, coincides with poor generalization. We call this phenomenon catastrophic Fisher explosion. Figure 1 illustrates this effect on the Tiny Image Net dataset (Le & Yang, 2015).
Researcher Affiliation	Collaboration	1NYU Langone Medical Center, New York University, USA 2Center of Data Science, New York University, USA 3Salesforce Research, USA 4Université de Montréal, Canada 5CIFAR Azrieli Global Scholar.
Pseudocode	No	The paper does not contain any structured pseudocode or algorithm blocks.
Open Source Code	No	The paper does not provide any explicit statements about releasing source code or links to a code repository.
Open Datasets	Yes	We demonstrate on image classiﬁcation tasks that Tr(F) early in training correlates with the ﬁnal generalization performance across settings with different learning rates or batch sizes. [...] Figure 1 illustrates this effect on the Tiny Image Net dataset (Le & Yang, 2015). [...] We run experiments in two settings: (1) Res Net-18 with Fixup (He et al., 2016; Zhang et al., 2019) trained on the Image Net dataset (Deng et al., 2009), (2) Res Net-26 initialized as in (Arpit et al., 2019) and trained on the CIFAR-10 and CIFAR-100 datasets (Krizhevsky, 2009).
Dataset Splits	Yes	We train each architecture using SGD, with various values of η, S, and random seed. [...] We tune the hyperparameters on the validation set. More speciﬁcally for α, we test 10 different values spaced uniformly between 10 1 v and 101 v on a logarithmic scale with v R+. For Tiny Image Net, we evaluate 5 values spaced equally on a logarithmic scale. We include the remaining experimental details in the Supplement I.2.
Hardware Specification	No	The paper does not explicitly describe the specific hardware used to run its experiments, such as GPU/CPU models or memory specifications.
Software Dependencies	No	The paper does not provide specific software dependencies with version numbers (e.g., Python 3.8, PyTorch 1.9, CUDA 11.1) needed to replicate the experiment.
Experiment Setup	Yes	We train a Wide Res Net model (depth 44, width 3) (Zagoruyko & Komodakis, 2016) on the Tiny Image Net dataset with SGD and two different learning rates. [...] For Image Net, we use learning rates 0.001, 0.01, 0.1, and ǫ = 3.5. For CIFAR-10, we use learning rates 0.007, 0.01, 0.05, and ǫ = 1.2. For CIFAR-100, we use learning rates 0.001, 0.005, 0.01, and ǫ = 3.5. [...] We tune the hyperparameters on the validation set. More speciﬁcally for α, we test 10 different values spaced uniformly between 10 1 v and 101 v on a logarithmic scale with v R+. For Tiny Image Net, we evaluate 5 values spaced equally on a logarithmic scale.