Does a sparse ReLU network training problem always admit an optimum ?

Authors: TUNG LE, Remi Gribonval, Elisa Riccietti

NeurIPS 2023 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Figure 1 illustrates the behavior of the relative errors of the training set, validation set and the sum of weight matrices norm along epochs, using Stochastic Gradient Descent (SGD) with batch size 3000, learning rate 0.1, momentum 0.9 and four different weight decays (the hyperparameter controlling the L2 regularizer) λ {0, 10 4, 5 10 4, 10 3}. The case λ = 0 corresponds to the unregularized case. Our training and testing sets contain each P = 105 samples generated independently as xi U([ 1, 1]d) (d = 100) and yi := Axi.We test this algorithm on a one-hidden layer Re LU network with two 100 100 weight matrices.
Researcher Affiliation Academia Univ. Lyon, Inria, CNRS, ENS de Lyon, UCB Lyon 1, LIP UMR 5668, F-69342 Lyon, France
Pseudocode No The paper describes algorithms verbally (e.g., quantifier elimination, detection algorithm) but does not present them in structured pseudocode or algorithm blocks.
Open Source Code Yes Code for reproducible research: Does a sparse Re LU network training problemalways admit an optimum? Code repository available at https://hal.science/hal-04233925, October 2023.
Open Datasets No Our training and testing sets contain each P = 105 samples generated independently as xi U([ 1, 1]d) (d = 100) and yi := Axi. The paper describes generating its own synthetic dataset and does not provide a link, DOI, or formal citation for a publicly available or open dataset.
Dataset Splits No Our training and testing sets contain each P = 105 samples generated independently as xi U([ 1, 1]d) (d = 100) and yi := Axi. The paper mentions training and testing sets but does not specify a validation set or provide percentages for dataset splits.
Hardware Specification No The authors thank the Blaise Pascal Center (CBP) for the computational means. It uses the SIDUS [27] solution developed by Emmanuel Quemener. The paper mentions general "computational means" but does not provide specific hardware details such as GPU/CPU models, memory, or other detailed computer specifications used for the experiments.
Software Dependencies No small toy examples (for example, Example 3.1 with d = 2) can be verified using Z3Prover1, a software implementing exactly the algorithm in Lemma 3.3. The paper mentions "Z3Prover" but does not specify a version number for it or any other software dependencies.
Experiment Setup Yes Figure 1 illustrates the behavior of the relative errors of the training set, validation set and the sum of weight matrices norm along epochs, using Stochastic Gradient Descent (SGD) with batch size 3000, learning rate 0.1, momentum 0.9 and four different weight decays (the hyperparameter controlling the L2 regularizer) λ {0, 10 4, 5 10 4, 10 3}.