Generalization on the Unseen, Logic Reasoning and Degree Curriculum
Authors: Emmanuel Abbe, Samy Bengio, Aryo Lotfi, Kevin Rizk
ICML 2023 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | In this section, we present our experimental results on the min-degree bias of neural networks.2 We have used four architectures for our experiments: a multi-layer perceptron (MLP) with 4 hidden layers, the random features model (Definition 3.5), Transformers (Vaswani et al., 2017), and 2-layer neural network with mean-field parametrization (Mei et al., 2018). By doing this, we consider a spectrum of models covering lazy regimes, active/feature learning regimes, and models of practical interest. For the Transformer, 1 bits are first encoded using an encoding layer and then passed to the Transformer; while for the rest of the architectures, binary vectors are directly used as the input. For each experiment, we generate all binary sequences in Uc = { 1}d \ U for training.3 We then train models under the ℓ2 loss. We employ Adam (Kingma & Ba, 2014) optimizer for the Transformer model and mini-batch SGD for the rest of the architectures. We also use moderate learning rates as learning rate can affect the results (refer to Appendix B.2). During training, we evaluate the coefficients of the function learned by the neural network using ˆf NN(T) = Ex U{ 1}d[χT (x)f NN(x)] to understand which interpolating solution has been learned by the model. Moreover, each experiment is repeated 10 times and averaged results are reported. For more information on the setup of experiments, hyperparameter sensitivity analysis, and additional experiments refer to Appendix B. |
| Researcher Affiliation | Collaboration | 1EPFL 2Apple. Correspondence to: Aryo Lotfi <aryo.lotfi@epfl.ch>. |
| Pseudocode | Yes | Algorithm 1 Degree-Curriculum algorithm |
| Open Source Code | Yes | Code: https://github.com/aryol/GOTU |
| Open Datasets | No | For each experiment, we generate all binary sequences in Uc = { 1}d \ U for training.Dimension 15 is used as a large dimension where the training data can be generated explicitly but has otherwise no specific meaning (Appendix B provides other instances). The paper describes how data is generated internally rather than providing access to a pre-existing public dataset. |
| Dataset Splits | No | The paper describes training on a 'seen domain' and testing on an 'unseen domain', but does not specify a validation set or explicit train/test/validation splits needed for reproduction. |
| Hardware Specification | Yes | Additionally, the experiments were executed on NVIDIA A100 GPUs and the experiments took around 60 GPU hours in total (excluding the selection of hyperparameters). |
| Software Dependencies | No | The paper mentions using 'PyTorch framework (Paszke et al., 2019)' but does not specify a version number for PyTorch or any other software dependencies. |
| Experiment Setup | Yes | For the Transformer, we have used Adam (Kingma & Ba, 2014) optimizer with batch size 256. For the RF models, we have used mini-batch SGD with a batch size of 256. Also, for the rest of the architectures, SGD with batch size 64 has been used. |