Generalization on the Unseen, Logic Reasoning and Degree Curriculum

Authors: Emmanuel Abbe, Samy Bengio, Aryo Lotfi, Kevin Rizk

ICML 2023 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental In this section, we present our experimental results on the min-degree bias of neural networks.2 We have used four architectures for our experiments: a multi-layer perceptron (MLP) with 4 hidden layers, the random features model (Definition 3.5), Transformers (Vaswani et al., 2017), and 2-layer neural network with mean-field parametrization (Mei et al., 2018). By doing this, we consider a spectrum of models covering lazy regimes, active/feature learning regimes, and models of practical interest. For the Transformer, 1 bits are first encoded using an encoding layer and then passed to the Transformer; while for the rest of the architectures, binary vectors are directly used as the input. For each experiment, we generate all binary sequences in Uc = { 1}d \ U for training.3 We then train models under the ℓ2 loss. We employ Adam (Kingma & Ba, 2014) optimizer for the Transformer model and mini-batch SGD for the rest of the architectures. We also use moderate learning rates as learning rate can affect the results (refer to Appendix B.2). During training, we evaluate the coefficients of the function learned by the neural network using ˆf NN(T) = Ex U{ 1}d[χT (x)f NN(x)] to understand which interpolating solution has been learned by the model. Moreover, each experiment is repeated 10 times and averaged results are reported. For more information on the setup of experiments, hyperparameter sensitivity analysis, and additional experiments refer to Appendix B.
Researcher Affiliation Collaboration 1EPFL 2Apple. Correspondence to: Aryo Lotfi <aryo.lotfi@epfl.ch>.
Pseudocode Yes Algorithm 1 Degree-Curriculum algorithm
Open Source Code Yes Code: https://github.com/aryol/GOTU
Open Datasets No For each experiment, we generate all binary sequences in Uc = { 1}d \ U for training.Dimension 15 is used as a large dimension where the training data can be generated explicitly but has otherwise no specific meaning (Appendix B provides other instances). The paper describes how data is generated internally rather than providing access to a pre-existing public dataset.
Dataset Splits No The paper describes training on a 'seen domain' and testing on an 'unseen domain', but does not specify a validation set or explicit train/test/validation splits needed for reproduction.
Hardware Specification Yes Additionally, the experiments were executed on NVIDIA A100 GPUs and the experiments took around 60 GPU hours in total (excluding the selection of hyperparameters).
Software Dependencies No The paper mentions using 'PyTorch framework (Paszke et al., 2019)' but does not specify a version number for PyTorch or any other software dependencies.
Experiment Setup Yes For the Transformer, we have used Adam (Kingma & Ba, 2014) optimizer with batch size 256. For the RF models, we have used mini-batch SGD with a batch size of 256. Also, for the rest of the architectures, SGD with batch size 64 has been used.