reproducibilityindex.ai

Learning Recurrent Binary/Ternary Weights

Authors: Arash Ardakani, Zhengyun Ji, Sean C. Smithson, Brett H. Meyer, Warren J. Gross

ICLR 2019 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	On the software side, we evaluate the performance (in terms of accuracy) of our method using long short-term memories (LSTMs) and gated recurrent units (GRUs) on various sequential models including sequence classiﬁcation and language modeling. We demonstrate that our method achieves competitive results on the aforementioned tasks while using binary/ternary weights during the runtime.
Researcher Affiliation	Academia	Department of Electrical and Computer Engineering, Mc Gill University, Montreal, Canada
Pseudocode	Yes	Algorithm 1: Training with recurrent binary/ternary weights. l is the cross entropy loss function. B/T() speciﬁes the binarization/ternarization function. The batch normalization transform is also denoted by BN( ; φ, γ). L and T are also the number of LSTM layers and time steps, respectively. Data: Full-precision LSTM parameters Wfh, Wih, Woh, Wgh, Wfx, Wix, Wox, Wgx, bf, bi, bo and bg for each layer. Batch normalization parameters for hidden-to-hidden and input-to-hidden states. The classiﬁer parameters Ws and bs. Input data x1, its corresponding targets y for each minibatch.
Open Source Code	Yes	The codes for these tasks are available online at https://github.com/arashardakani/Learning-Recurrent-Binary-Ternary-Weights
Open Datasets	Yes	For the character-level modeling, the goal is to predict the next character and the performance is evaluated on bits per character (BPC) where lower BPC is desirable. We conduct quantization experiments on Penn Treebank (Marcus et al. (1993)), War & Peace (Karpathy et al. (2015)) and Linux Kernel (Karpathy et al. (2015)) corpora. For Penn Treebank dataset, we use a similar LSTM model conﬁguration and data preparation to Mikolov et al. (2012). For War & Peace and Linux Kernel datasets, we also follow the LSTM model conﬁgurations and settings in (Karpathy et al. (2015)).
Dataset Splits	Yes	Penn Treebank: Similar to Mikolov et al. (2012), we split the Penn Treebank corpus into 5017k, 393k and 442k training, validation and test characters, respectively.
Hardware Specification	No	The paper discusses custom hardware implementations and ASIC architectures as a result of their method (e.g., 'We implemented our low-power inference engine... in TSMC 65-nm CMOS technology'). However, it does not specify the hardware (e.g., specific CPU, GPU models, or cloud computing resources) used to run the deep learning training and evaluation experiments described in Section 5.
Software Dependencies	No	The paper mentions using specific training rules like 'ADAM learning rule' and 'Stochastic gradient descent', but it does not provide specific version numbers for any programming languages (e.g., Python), deep learning frameworks (e.g., TensorFlow, PyTorch), or other libraries (e.g., scikit-learn, NumPy) that were used to implement and run the experiments.
Experiment Setup	Yes	For Penn Treebank: 'The cross entropy loss is minimized on minibatches of size 64 while using ADAM learning rule. We use a learning rate of 0.002.' For Linux Kernel and War & Peace: 'We use one LSTM layer of size 512 followed by a softmax classiﬁer layer. We use an exponentially decaying learning rate initialized with 0.002. ADAM learning rule is also used as the update rule.' For Text8: 'For this task, we use one LSTM layer of size 2000 and train it on sequences of length 180 with minibatches of size 128. The learning rate of 0.001 is used and the update rule is determined by ADAM.' For word-level Penn Treebank: 'We start the training with a learning rate of 20. We then divide it by 4 every time we see an increase in the validation perplexity value. The model is trained with the word sequence length of 35 and the dropout probability of 0.5, 0.65 and 0.65 for the small, medium and large models, respectively. Stochastic gradient descent is also used to train our model while the gradient norm is clipped at 0.25.' For MNIST: '...ADAM step rule with learning rate of 0.001.' For CNN QA: '...bidirectional LSTM with unit size of 256. We also use minibatches of size 128 and ADAM learning rule. We use an exponentially decaying learning rate initialized with 0.003.'