On Monotonic Linear Interpolation of Neural Network Parameters

Authors: James R Lucas, Juhan Bae, Michael R Zhang, Stanislav Fort, Richard Zemel, Roger B Grosse

ICML 2021 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Extending this work, we evaluate several hypotheses for this property that, to our knowledge, have not yet been explored. Using tools from differential geometry, we draw connections between the interpolated paths in function space and the monotonicity of the network providing sufficient conditions for the MLI property under mean squared error. While the MLI property holds under various settings (e.g. network architectures and learning problems), we show in practice that networks violating the MLI property can be produced systematically, by encouraging the weights to move far from initialization. ... To address these questions, we provide an expanded empirical and theoretical study of this phenomenon.
Researcher Affiliation Academia James Lucas 1 2 Juhan Bae 1 2 Michael R. Zhang 1 2 Stanislav Fort 3 Richard Zemel 1 2 Roger Grosse 1 2 1University of Toronto 2Vector Institute 3Stanford University. Correspondence to: James Lucas <jlucas@cs.toronto.edu>.
Pseudocode No The paper does not contain any pseudocode or algorithm blocks.
Open Source Code No The paper does not provide a direct link to open-source code for the described methodology or an explicit statement of its release.
Open Datasets Yes For the reconstruction tasks, we trained fully-connected deep autoencoders on MNIST (Le Cun et al., 2010). For the classification tasks, we trained networks on MNIST, Fashion-MNIST (Xiao et al., 2017), CIFAR-10, and CIFAR-100 (Krizhevsky et al., 2009). ... We provide a short study on the language modeling setting as well by training LSTM (Hochreiter & Schmidhuber, 1997b) and Transformer (Vaswani et al., 2017) architectures on Wiki Text-2 (Stephen et al., 2016) dataset. We also experimented with Ro BERTa (Liu et al., 2019) on the Esperanto (Conneau et al., 2019) dataset.
Dataset Splits No The paper mentions 'training set' and 'held-out datasets' but does not specify exact training/validation/test dataset splits, percentages, or explicit sample counts for reproduction.
Hardware Specification No The paper does not provide specific hardware details (e.g., exact GPU/CPU models, processor types, or memory amounts) used for running its experiments.
Software Dependencies No The paper mentions the 'Hugging Face library' but does not provide specific version numbers for key software components or libraries required for replication.
Experiment Setup Yes For all experiments, unless specified otherwise, we discretize in the interval [0, 1] using 50 uniform steps. ... We trained autoencoders with SGD and Adam, with varying learning rates and with a varying number of hidden layer size. ... LR: 0.001 0.003 0.01 0.03 0.1 0.3 1.0 3.0 ... We also varied the distribution over initial parameters and whether or not batch normalization was applied. The results for CIFAR-10 are displayed in Table 2 (CIFAR-100 results are similar, and are presented in Appendix C). The column headers, BN and NBN indicate batch normalization and no batch normalization respectively. The suffices I and F indicate two alternative initialization schemes, block-identity initialization (Goyal et al., 2017) and Fixup initialization (Zhang et al., 2019b).