Variational Dropout Sparsifies Deep Neural Networks

Authors: Dmitry Molchanov, Arsenii Ashukha, Dmitry Vetrov

ICML 2017 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental We perform experiments on classification tasks and use different neural network architectures including architectures with a combination of batch normalization and dropout layers.
Researcher Affiliation Collaboration 1Yandex, Russia 2Skolkovo Institute of Science and Technology, Skolkovo Innovation Center, Moscow, Russia 3National Research University Higher School of Economics, Moscow, Russia 4Moscow Institute of Physics and Technology, Moscow, Russia.
Pseudocode No The paper provides mathematical expressions for calculations (e.g., equations 17 and 18) but does not include a clearly labeled pseudocode block or algorithm steps.
Open Source Code Yes Lasagne and Py Torch source code of Sparse Variational Dropout layers is available at https://goo.gl/2D4tFW.
Open Datasets Yes We compare our method with other methods of training sparse neural networks on the MNIST dataset using a fully-connected architecture Le Net-300-100 and a convolutional architecture Le Net-5-Caffe. We use CIFAR-10 and CIFAR-100 for evaluation.
Dataset Splits No The paper mentions training and testing on datasets like MNIST and CIFAR, but does not explicitly state the use of a separate validation dataset split or its size/methodology for hyperparameter tuning.
Hardware Specification No The paper does not explicitly describe the hardware (e.g., specific GPU/CPU models, memory details, or cloud instance types) used to run its experiments.
Software Dependencies No The paper mentions 'Lasagne and Py Torch' as frameworks and 'Adam' as an optimizer but does not provide specific version numbers for any software dependencies.
Experiment Setup Yes When we start from a random initialization, we train for 200 epochs and linearly decay the learning rate from 10 4 to zero. When we start from a pre-trained model, we finetune for 10-30 epochs with learning rate 10 5. We train all networks using Adam (Kingma & Ba, 2014).