Variational Dropout Sparsifies Deep Neural Networks
Authors: Dmitry Molchanov, Arsenii Ashukha, Dmitry Vetrov
ICML 2017 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | We perform experiments on classification tasks and use different neural network architectures including architectures with a combination of batch normalization and dropout layers. |
| Researcher Affiliation | Collaboration | 1Yandex, Russia 2Skolkovo Institute of Science and Technology, Skolkovo Innovation Center, Moscow, Russia 3National Research University Higher School of Economics, Moscow, Russia 4Moscow Institute of Physics and Technology, Moscow, Russia. |
| Pseudocode | No | The paper provides mathematical expressions for calculations (e.g., equations 17 and 18) but does not include a clearly labeled pseudocode block or algorithm steps. |
| Open Source Code | Yes | Lasagne and Py Torch source code of Sparse Variational Dropout layers is available at https://goo.gl/2D4tFW. |
| Open Datasets | Yes | We compare our method with other methods of training sparse neural networks on the MNIST dataset using a fully-connected architecture Le Net-300-100 and a convolutional architecture Le Net-5-Caffe. We use CIFAR-10 and CIFAR-100 for evaluation. |
| Dataset Splits | No | The paper mentions training and testing on datasets like MNIST and CIFAR, but does not explicitly state the use of a separate validation dataset split or its size/methodology for hyperparameter tuning. |
| Hardware Specification | No | The paper does not explicitly describe the hardware (e.g., specific GPU/CPU models, memory details, or cloud instance types) used to run its experiments. |
| Software Dependencies | No | The paper mentions 'Lasagne and Py Torch' as frameworks and 'Adam' as an optimizer but does not provide specific version numbers for any software dependencies. |
| Experiment Setup | Yes | When we start from a random initialization, we train for 200 epochs and linearly decay the learning rate from 10 4 to zero. When we start from a pre-trained model, we finetune for 10-30 epochs with learning rate 10 5. We train all networks using Adam (Kingma & Ba, 2014). |