meProp: Sparsified Back Propagation for Accelerated Deep Learning with Reduced Overfitting
Authors: Xu Sun, Xuancheng Ren, Shuming Ma, Houfeng Wang
ICML 2017 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Experimental results demonstrate that we can update only 1 4% of the weights at each back propagation pass. This does not result in a larger number of training iterations. More interestingly, the accuracy of the resulting models is actually improved rather than degraded, and a detailed analysis is given. |
| Researcher Affiliation | Academia | 1School of Electronics Engineering and Computer Science, Peking University, China 2MOE Key Laboratory of Computational Linguistics, Peking University, China. |
| Pseudocode | No | No section or figure explicitly labeled "Pseudocode" or "Algorithm" was found. |
| Open Source Code | No | The paper states "We have coded two neural network models... on our own." and "We also have an implementation based on the Py Torch framework for GPU based experiments." but does not provide any link or explicit statement about making the code publicly available. |
| Open Datasets | Yes | Part-of-Speech Tagging (POS-Tag): We use the standard benchmark dataset in prior work (Collins, 2002), which is derived from the Penn Treebank corpus, and use sections 0-18 of the Wall Street Journal (WSJ) for training (38,219 examples), and sections 22-24 for testing (5,462 examples). Transition-based Dependency Parsing (Parsing): Following prior work, we use English Penn Tree Bank (PTB) (Marcus et al., 1993) for evaluation. MNIST Image Recognition (MNIST): We use the MNIST handwritten digit dataset (Le Cun et al., 1998) for evaluation. |
| Dataset Splits | Yes | We use sections 0-18 of the Wall Street Journal (WSJ) for training (38,219 examples), and sections 22-24 for testing (5,462 examples). We follow the standard split of the corpus and use sections 2-21 as the training set (39,832 sentences, 1,900,056 transition examples),2 section 22 as the development set (1,700 sentences, 80,234 transition examples) and section 23 as the final test set (2,416 sentences, 113,368 transition examples). MNIST consists of 60,000 28 28 pixel training images and additional 10,000 test examples. Each image contains a single numerical digit (0-9). We select the first 5,000 images of the training images as the development set and the rest as the training set. |
| Hardware Specification | Yes | The experiments on CPU are conducted on a computer with the INTEL(R) Xeon(R) 3.0GHz CPU. The experiments on GPU are conducted on NVIDIA Ge Force GTX 1080. |
| Software Dependencies | No | The paper mentions software like "C#" for the framework and "Py Torch framework" but does not specify version numbers for these software components or any other libraries. |
| Experiment Setup | Yes | We set the dimension of the hidden layers to 500 for all the tasks. ... For the Adam optimization method, we find the default hyper-parameters work well on development sets, which are as follows: the learning rate α = 0.001, and β1 = 0.9, β2 = 0.999, ϵ = 1 10 8. For the Ada Grad learner, the learning rate is set to α = 0.01, 0.01, 0.1 for POSTag, Parsing, and MNIST, respectively, and ϵ = 1 10 6. ... we set the mini-batch size to 1 (sentence), 10,000 (transition examples), and 10 (images) for POS-Tag, Parsing, and MNIST, respectively. |