Should Under-parameterized Student Networks Copy or Average Teacher Weights?
Authors: Berfin Simsek, Amire Bendjeddou, Wulfram Gerstner, Johanni Brea
NeurIPS 2023 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Empirically, we find for the erf activation function that gradient flow converges either to the optimal copy-average critical point or to another point where each student neuron approximately copies a different teacher neuron. Finally, we find similar results for the Re LU activation function, suggesting that the optimal solution of underparameterized networks has a universal structure.The code to reproduce these findings is available on Git Hub, and we refer to Appendix C for details.All experiments in this paper are implemented using the gradient flow package implemented by Brea et al. [16] which is particularly suited to studying gradient flow on the population loss. |
| Researcher Affiliation | Academia | Berfin Sim sek NYU bs3736@nyu.eduAmire Bendjeddou EPFL amire.bendjeddou@epfl.chWulfram Gerstner EPFL wulfram.gerstner@epfl.chJohanni Brea EPFL johanni.brea@epfl.ch |
| Pseudocode | No | The paper does not contain any clearly labeled pseudocode or algorithm blocks. The methods are described in narrative text and mathematical equations. |
| Open Source Code | Yes | The code to reproduce these findings is available on Git Hub, and we refer to Appendix C for details. |
| Open Datasets | No | The paper mentions using a "standard d-dimensional Gaussian D = N(0, Id)" as input data for the teacher network, which is a synthetic setup, not a publicly available dataset that can be accessed via a link or citation in the typical sense. |
| Dataset Splits | No | The paper describes a theoretical setup for a teacher-student network, where the student network is trained to approximate the teacher. It does not mention using explicit train/validation/test dataset splits in the conventional machine learning sense for empirical data, but rather trains on a continuous input distribution. |
| Hardware Specification | No | The paper states, "We use a numerical ODE solver for multi-layer networks [16] to simulate the gradient flow in this paper." However, it does not specify any particular hardware details such as GPU models, CPU types, or memory specifications used for these simulations. |
| Software Dependencies | No | The paper mentions "gradient flow package implemented by Brea et al. [16]" and "Glorot initialization [43]" but does not specify version numbers for these software components or any other libraries. |
| Experiment Setup | Yes | All "solutions", which are the points at which gradient flow converges, have a gradient norm of at most 5 10 8.All experiments in this paper are implemented using the gradient flow package implemented by Brea et al. [16] which is particularly suited to studying gradient flow on the population loss.For erf experiments, all seeds converged to configurations with gradient norm below 5 10 8.For Re LU experiments, a fraction of seeds failed to converge (large gradient norm at the end of training). In Appendix C.3, we report among the seeds that succeeded in converging. Weights initialized as Gaussians with zero mean and standard deviation 0.1 or with Glorot initialization [43].For each (n, k) pair, we implemented 10 or 20 seeds of random initializations. |