Penalising the biases in norm regularisation enforces sparsity
Authors: Etienne Boursier, Nicolas Flammarion
NeurIPS 2023 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Finally, the significance of bias term regularisation in achieving sparser estimators during neural network training is illustrated on toy examples in Section 6. This section compares, through Figure 3, the estimators that are obtained with and without counting the bias terms in the regularisation, when training a one-hidden Re LU layer neural network. |
| Researcher Affiliation | Academia | Etienne Boursier INRIA CELESTE, LMO, Orsay, France etienne.boursier@inria.fr Nicolas Flammarion TML Lab, EPFL, Switzerland nicolas.flammarion@epfl.ch |
| Pseudocode | No | The paper contains mathematical derivations and proofs, but no structured pseudocode or algorithm blocks are present. |
| Open Source Code | Yes | The code is made available at github.com/eboursier/penalising_biases. |
| Open Datasets | No | The paper mentions using 'toy examples' for illustration, but it does not provide specific dataset names, citations, or links for public access. |
| Dataset Splits | No | The paper uses 'toy examples' for illustration and does not provide specific details on dataset splits (e.g., train/validation/test percentages or counts). |
| Hardware Specification | No | The paper does not provide any specific details about the hardware (e.g., GPU, CPU models, or memory) used to run the experiments. |
| Software Dependencies | No | The paper discusses training neural networks but does not provide specific software names with version numbers (e.g., PyTorch 1.9, Python 3.8). |
| Experiment Setup | Yes | For this experiment, we train neural networks by minimising the empirical loss, regularised with the ℓ2 norm of the parameters (either with or without the bias terms) with a regularisation factor λ = 10−3. Each neural network has m = 200 hidden neurons and all parameters are initialised i.i.d. as centered Gaussian variables of variance 1/√m (similar results are observed for larger initialisation scales). |