Learning Recurrent Binary/Ternary Weights
Authors: Arash Ardakani, Zhengyun Ji, Sean C. Smithson, Brett H. Meyer, Warren J. Gross
ICLR 2019 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | On the software side, we evaluate the performance (in terms of accuracy) of our method using long short-term memories (LSTMs) and gated recurrent units (GRUs) on various sequential models including sequence classification and language modeling. We demonstrate that our method achieves competitive results on the aforementioned tasks while using binary/ternary weights during the runtime. |
| Researcher Affiliation | Academia | Department of Electrical and Computer Engineering, Mc Gill University, Montreal, Canada |
| Pseudocode | Yes | Algorithm 1: Training with recurrent binary/ternary weights. l is the cross entropy loss function. B/T() specifies the binarization/ternarization function. The batch normalization transform is also denoted by BN( ; φ, γ). L and T are also the number of LSTM layers and time steps, respectively. Data: Full-precision LSTM parameters Wfh, Wih, Woh, Wgh, Wfx, Wix, Wox, Wgx, bf, bi, bo and bg for each layer. Batch normalization parameters for hidden-to-hidden and input-to-hidden states. The classifier parameters Ws and bs. Input data x1, its corresponding targets y for each minibatch. |
| Open Source Code | Yes | The codes for these tasks are available online at https://github.com/arashardakani/Learning-Recurrent-Binary-Ternary-Weights |
| Open Datasets | Yes | For the character-level modeling, the goal is to predict the next character and the performance is evaluated on bits per character (BPC) where lower BPC is desirable. We conduct quantization experiments on Penn Treebank (Marcus et al. (1993)), War & Peace (Karpathy et al. (2015)) and Linux Kernel (Karpathy et al. (2015)) corpora. For Penn Treebank dataset, we use a similar LSTM model configuration and data preparation to Mikolov et al. (2012). For War & Peace and Linux Kernel datasets, we also follow the LSTM model configurations and settings in (Karpathy et al. (2015)). |
| Dataset Splits | Yes | Penn Treebank: Similar to Mikolov et al. (2012), we split the Penn Treebank corpus into 5017k, 393k and 442k training, validation and test characters, respectively. |
| Hardware Specification | No | The paper discusses custom hardware implementations and ASIC architectures as a *result* of their method (e.g., 'We implemented our low-power inference engine... in TSMC 65-nm CMOS technology'). However, it does not specify the hardware (e.g., specific CPU, GPU models, or cloud computing resources) used to run the deep learning training and evaluation *experiments* described in Section 5. |
| Software Dependencies | No | The paper mentions using specific training rules like 'ADAM learning rule' and 'Stochastic gradient descent', but it does not provide specific version numbers for any programming languages (e.g., Python), deep learning frameworks (e.g., TensorFlow, PyTorch), or other libraries (e.g., scikit-learn, NumPy) that were used to implement and run the experiments. |
| Experiment Setup | Yes | For Penn Treebank: 'The cross entropy loss is minimized on minibatches of size 64 while using ADAM learning rule. We use a learning rate of 0.002.' For Linux Kernel and War & Peace: 'We use one LSTM layer of size 512 followed by a softmax classifier layer. We use an exponentially decaying learning rate initialized with 0.002. ADAM learning rule is also used as the update rule.' For Text8: 'For this task, we use one LSTM layer of size 2000 and train it on sequences of length 180 with minibatches of size 128. The learning rate of 0.001 is used and the update rule is determined by ADAM.' For word-level Penn Treebank: 'We start the training with a learning rate of 20. We then divide it by 4 every time we see an increase in the validation perplexity value. The model is trained with the word sequence length of 35 and the dropout probability of 0.5, 0.65 and 0.65 for the small, medium and large models, respectively. Stochastic gradient descent is also used to train our model while the gradient norm is clipped at 0.25.' For MNIST: '...ADAM step rule with learning rate of 0.001.' For CNN QA: '...bidirectional LSTM with unit size of 256. We also use minibatches of size 128 and ADAM learning rule. We use an exponentially decaying learning rate initialized with 0.003.' |