reproducibilityindex.ai

Learning Sparse Neural Networks through L_0 Regularization

Authors: Christos Louizos, Max Welling, Diederik P. Kingma

ICLR 2018 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	We perform various experiments to demonstrate the effectiveness of the resulting approach and regularizer.
Researcher Affiliation	Collaboration	Christos Louizos University of Amsterdam TNO, Intelligent Imaging c.louizos@uva.nl Max Welling University of Amsterdam CIFAR m.welling@uva.nl Diederik P. Kingma Open AI dpkingma@openai.com
Pseudocode	No	The paper does not contain any pseudocode or algorithm blocks.
Open Source Code	No	The paper does not provide any explicit statement about releasing source code or links to a code repository.
Open Datasets	Yes	We validate the effectiveness of our method on two tasks. The ﬁrst corresponds to the toy classiﬁcation task of MNIST using a simple multilayer perceptron (MLP) with two hidden layers of size 300 and 100 (Le Cun et al., 1998), and a simple convolutional network, the Le Net-5-Caffe. The second corresponds to the more modern task of CIFAR 10 and CIFAR 100 classiﬁcation using Wide Residual Networks (Zagoruyko & Komodakis, 2016).
Dataset Splits	No	The paper mentions training and testing but does not explicitly describe a validation set split or how validation was performed.
Hardware Specification	No	The paper mentions that the minibatch was 'split between two GPUs' but does not provide any specific details about the GPU models or other hardware specifications.
Software Dependencies	No	The paper mentions using 'Adam' optimizer and referring to 'default hyper-parameters' but does not specify any software versions for libraries or frameworks used.
Experiment Setup	Yes	For all of our experiments we set γ = 0.1, ζ = 1.1 and, following the recommendations from Maddison et al. (2016), set β = 2/3 for the concrete distributions. We initialized the locations log α by sampling from a normal distribution with a standard deviation of 0.01 and a mean that yields α α+1 to be approximately equal to the original dropout rate employed at each of the networks. We used a single sample of the gate z for each minibatch of datapoints during the optimization, even though this can lead to larger variance in the gradients (Kingma et al., 2015). For these experiments we did no further regularization besides the L0 norm and optimization was done with Adam (Kingma & Ba, 2014) using the default hyper-parameters and temporal averaging. For optimization we employed the procedure described in Zagoruyko & Komodakis (2016) with a minibatch of 128 datapoints, which was split between two GPUs, and used a single sample for the gates for each GPU.