On Lazy Training in Differentiable Programming

Authors: Lénaïc Chizat, Edouard Oyallon, Francis Bach

NeurIPS 2019 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Our numerical experiments bring a critical note, as we observe that the performance of commonly used non-linear deep convolutional neural networks in computer vision degrades when trained in the lazy regime.
Researcher Affiliation Academia Lénaïc Chizat CNRS, Université Paris-Sud Orsay, France lenaic.chizat@u-psud.fr Edouard Oyallon Centrale Supelec, INRIA Gif-sur-Yvette, France edouard.oyallon@centralesupelec.fr Francis Bach INRIA, ENS, PSL Research University Paris, France francis.bach@inria.fr
Pseudocode No The paper does not contain any structured pseudocode or algorithm blocks.
Open Source Code Yes The code to reproduce these experiments is available online7. 7https://github.com/edouardoyallon/lazy-training-CNN
Open Datasets Yes We consider the VGG-11 model [32], which is a widely used model on CIFAR10.
Dataset Splits No The paper mentions 'test loss' and 'test accuracy' suggesting train/test splits for the datasets (synthetic and CIFAR-10), but it does not explicitly provide specific percentages, sample counts, or formal citations for predefined training, validation, or test splits. No 'validation' set is mentioned.
Hardware Specification No The paper mentions 'a GPU donation from NVIDIA' in the acknowledgments but does not specify the model of the GPU or any other specific hardware components (CPU, memory, etc.) used for the experiments.
Software Dependencies No The paper does not specify any software dependencies with version numbers (e.g., Python 3.8, PyTorch 1.9).
Experiment Setup Yes We trained it via mini-batch SGD with a momentum parameter of 0.9... An initial learning rate η0 is linearly decayed at each epoch, following ηt = η0 1+βt. The biases are initialized with 0 and all other weights are initialized with normal Xavier initialization [13]... The model h is trained for the square loss multiplied by 1/α^2... with standard data-augmentation, batch-size of 128 [35] and η0 = 1... The total number of epochs is 70... We choose α = 10^7... a batch-size of 8 and, after cross-validation, η0 = 0.01, 1.0... We also multiply the initial weights by respectively 1.2 and 1.3 for the Res Net-18 and VGG-11...