ProxSGD: Training Structured Neural Networks under Regularization and Constraints

Authors: Yang Yang, Yaxiong Yuan, Avraam Chatzimichailidis, Ruud JG van Sloun, Lei Lei, Symeon Chatzinotas

ICLR 2020 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Finally, to support the theoretical analysis and demonstrate the flexibility of Prox SGD, we show by extensive numerical tests how Prox SGD can be used to train either sparse or binary neural networks through an adequate selection of the regularization function and constraint set.
Researcher Affiliation Collaboration Yang Yang Fraunhofer ITWM Fraunhofer Center Machine Learning yang.yang@itwm.fraunhofer.de Yaxiong Yuan University of Luxembourg yaxiong.yuan@uni.lu Avraam Chatzimichailidis Fraunhofer ITWM TU Kaiserslautern avraam.chatzimichailidis@itwm.fraunhofer.de Ruud JG van Sloun Eindhoven University of Technology r.j.g.v.sloun@tue.nl Lei Lei, Symeon Chatzinotas University of Luxembourg {lei.lei, symeon.chatzinotas}@uni.lu
Pseudocode Yes Algorithm 1 Proximal-type Stochastic Gradient Descent (Prox SGD) Method
Open Source Code Yes The simulations in Setion 3.1 and 3.3 are implemented in Tensor Flow and available at https://github.com/optyang/proxsgd. The simulations in Section 3.2 are implemented in Py Torch and available at https://github.com/cc-hpc-itwm/proxsgd.
Open Datasets Yes We first consider the multiclass classification problem on CIFAR-10 dataset (Krizhevsky, 2009)
Dataset Splits No No explicit description of dataset splits (e.g., percentages, sample counts, or methodology for splitting data into train, validation, and test sets) was found. The paper mentions using well-known datasets like CIFAR-10, CIFAR-100, and MNIST, which have standard splits, but these specific splits are not detailed or cited within the text.
Hardware Specification No No specific hardware details (such as GPU models, CPU types, or memory) used for running the experiments were provided. The paper only mentions that simulations were implemented in TensorFlow and PyTorch.
Software Dependencies No The paper mentions that simulations are implemented in "TensorFlow" and "PyTorch" but does not provide specific version numbers for these or any other software dependencies.
Experiment Setup Yes Following the parameter configurations of ADAM in Kingma & Ba (2015), AMSGrad in Reddi et al. (2018), and ADABound in Luo et al. (2019), we set ρ = 0.1, β = 0.999 and ϵ = 0.001 (see Table 1), which are uniform for all the algorithms and commonly used in practice. Note that we have also activated ℓ1-regularization for these algorithms in the built-in function in Tensor Flow/Py Torch, which amounts to adding the subgradient of the ℓ1-norm to the gradient of the loss function. For the proposed Prox SGD, ϵ(t) and ρ(t) decrease over the iterations as follows, ϵ(t) = 0.06 (t + 4)0.5 , ρ(t) = 0.9 (t + 4)0.5 . Recall that the ℓ1-norm in the approximation subproblem naturally leads to the soft-thresholding proximal mapping, see (10). The regularization parameter µ in the soft-thresholding then permits controlling the sparsity of the parameter variable x; in this experiment we set µ = 5 10 5.