Catastrophic Fisher Explosion: Early Phase Fisher Matrix Impacts Generalization
Authors: Stanislaw Jastrzebski, Devansh Arpit, Oliver Astrand, Giancarlo B Kerg, Huan Wang, Caiming Xiong, Richard Socher, Kyunghyun Cho, Krzysztof J Geras
ICML 2021 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | We demonstrate on image classification tasks that Tr(F) early in training correlates with the final generalization performance across settings with different learning rates or batch sizes. We then show evidence that explicitly regularizing Tr(F), which we call Fisher penalty, recovers generalization degradation due to training with a sub-optimal (small) learning rate, and can significantly improve generalization when training with the optimal learning rate. On the other hand, achieving large Tr(F) early in training, which may occur in practice when using a relatively small learning rate, or due to bad initialization, coincides with poor generalization. We call this phenomenon catastrophic Fisher explosion. Figure 1 illustrates this effect on the Tiny Image Net dataset (Le & Yang, 2015). |
| Researcher Affiliation | Collaboration | 1NYU Langone Medical Center, New York University, USA 2Center of Data Science, New York University, USA 3Salesforce Research, USA 4Université de Montréal, Canada 5CIFAR Azrieli Global Scholar. |
| Pseudocode | No | The paper does not contain any structured pseudocode or algorithm blocks. |
| Open Source Code | No | The paper does not provide any explicit statements about releasing source code or links to a code repository. |
| Open Datasets | Yes | We demonstrate on image classification tasks that Tr(F) early in training correlates with the final generalization performance across settings with different learning rates or batch sizes. [...] Figure 1 illustrates this effect on the Tiny Image Net dataset (Le & Yang, 2015). [...] We run experiments in two settings: (1) Res Net-18 with Fixup (He et al., 2016; Zhang et al., 2019) trained on the Image Net dataset (Deng et al., 2009), (2) Res Net-26 initialized as in (Arpit et al., 2019) and trained on the CIFAR-10 and CIFAR-100 datasets (Krizhevsky, 2009). |
| Dataset Splits | Yes | We train each architecture using SGD, with various values of η, S, and random seed. [...] We tune the hyperparameters on the validation set. More specifically for α, we test 10 different values spaced uniformly between 10 1 v and 101 v on a logarithmic scale with v R+. For Tiny Image Net, we evaluate 5 values spaced equally on a logarithmic scale. We include the remaining experimental details in the Supplement I.2. |
| Hardware Specification | No | The paper does not explicitly describe the specific hardware used to run its experiments, such as GPU/CPU models or memory specifications. |
| Software Dependencies | No | The paper does not provide specific software dependencies with version numbers (e.g., Python 3.8, PyTorch 1.9, CUDA 11.1) needed to replicate the experiment. |
| Experiment Setup | Yes | We train a Wide Res Net model (depth 44, width 3) (Zagoruyko & Komodakis, 2016) on the Tiny Image Net dataset with SGD and two different learning rates. [...] For Image Net, we use learning rates 0.001, 0.01, 0.1, and ǫ = 3.5. For CIFAR-10, we use learning rates 0.007, 0.01, 0.05, and ǫ = 1.2. For CIFAR-100, we use learning rates 0.001, 0.005, 0.01, and ǫ = 3.5. [...] We tune the hyperparameters on the validation set. More specifically for α, we test 10 different values spaced uniformly between 10 1 v and 101 v on a logarithmic scale with v R+. For Tiny Image Net, we evaluate 5 values spaced equally on a logarithmic scale. |