Learning Sparse Neural Networks through L_0 Regularization
Authors: Christos Louizos, Max Welling, Diederik P. Kingma
ICLR 2018 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | We perform various experiments to demonstrate the effectiveness of the resulting approach and regularizer. |
| Researcher Affiliation | Collaboration | Christos Louizos University of Amsterdam TNO, Intelligent Imaging c.louizos@uva.nl Max Welling University of Amsterdam CIFAR m.welling@uva.nl Diederik P. Kingma Open AI dpkingma@openai.com |
| Pseudocode | No | The paper does not contain any pseudocode or algorithm blocks. |
| Open Source Code | No | The paper does not provide any explicit statement about releasing source code or links to a code repository. |
| Open Datasets | Yes | We validate the effectiveness of our method on two tasks. The first corresponds to the toy classification task of MNIST using a simple multilayer perceptron (MLP) with two hidden layers of size 300 and 100 (Le Cun et al., 1998), and a simple convolutional network, the Le Net-5-Caffe. The second corresponds to the more modern task of CIFAR 10 and CIFAR 100 classification using Wide Residual Networks (Zagoruyko & Komodakis, 2016). |
| Dataset Splits | No | The paper mentions training and testing but does not explicitly describe a validation set split or how validation was performed. |
| Hardware Specification | No | The paper mentions that the minibatch was 'split between two GPUs' but does not provide any specific details about the GPU models or other hardware specifications. |
| Software Dependencies | No | The paper mentions using 'Adam' optimizer and referring to 'default hyper-parameters' but does not specify any software versions for libraries or frameworks used. |
| Experiment Setup | Yes | For all of our experiments we set γ = 0.1, ζ = 1.1 and, following the recommendations from Maddison et al. (2016), set β = 2/3 for the concrete distributions. We initialized the locations log α by sampling from a normal distribution with a standard deviation of 0.01 and a mean that yields α α+1 to be approximately equal to the original dropout rate employed at each of the networks. We used a single sample of the gate z for each minibatch of datapoints during the optimization, even though this can lead to larger variance in the gradients (Kingma et al., 2015). For these experiments we did no further regularization besides the L0 norm and optimization was done with Adam (Kingma & Ba, 2014) using the default hyper-parameters and temporal averaging. For optimization we employed the procedure described in Zagoruyko & Komodakis (2016) with a minibatch of 128 datapoints, which was split between two GPUs, and used a single sample for the gates for each GPU. |