Precision Gating: Improving Neural Network Efficiency with Dynamic Dual-Precision Activations

Authors: Yichi Zhang, Ritchie Zhao, Weizhe Hua, Nayun Xu, G. Edward Suh, Zhiru Zhang

ICLR 2020 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Our experiments indicate that PG achieves excellent results on CNNs, including statically compressed mobile-friendly networks such as Shuffle Net. Compared to the state-of-the-art prediction-based quantization schemes, PG achieves the same or higher accuracy with 2.4 less compute on Image Net. PG furthermore applies to RNNs. Compared to 8-bit uniform quantization, PG obtains a 1.2% improvement in perplexity per word with 2.7 computational cost reduction on LSTM on the Penn Tree Bank dataset.
Researcher Affiliation Academia Yichi Zhang Cornell University yz2499@cornell.edu Ritchie Zhao Cornell University rz252@cornell.edu Weizhe Hua Cornell University wh399@cornell.edu Nayun Xu Cornell Tech nx38@cornell.edu G. Edward Suh Cornell University edward.suh@cornell.edu Zhiru Zhang Cornell University zhiruz@cornell.edu
Pseudocode No No pseudocode or algorithm blocks found. The paper describes the method using equations and explanatory text.
Open Source Code No We will release the source code on the author s website.
Open Datasets Yes We evaluate PG using Res Net-20 (He et al., 2016a) and Shift Net-20 (Wu et al., 2018a) on CIFAR10 (Krizhevsky & Hinton, 2009), and Shuffle Net V2 (Ma et al., 2018) on Image Net (Deng et al., 2009). We also test an LSTM model (Hochreiter & Schmidhuber, 1997) on the Penn Tree Bank (PTB) (Marcus et al., 1993) corpus.
Dataset Splits No The paper mentions training details but does not provide explicit dataset splits for train/validation/test. For example, "On CIFAR-10, the batch size is 128, and the models are trained for 200 epochs." does not specify the splits.
Hardware Specification Yes All experiments are conducted on Tensorflow (Abadi et al., 2016) with NVIDIA Ge Force 1080Ti GPUs. One of the Titan Xp GPUs used for this research was donated by the NVIDIA Corporation. Intel Xeon Silver 4114 CPU (2.20GHz).
Software Dependencies No All experiments are conducted on Tensorflow (Abadi et al., 2016) with NVIDIA Ge Force 1080Ti GPUs. We implement the SDDMM kernel in Python leveraging a high performance JIT compiler Numba (Lam et al., 2015). No version numbers for TensorFlow or Numba are specified.
Experiment Setup Yes On CIFAR-10, the batch size is 128, and the models are trained for 200 epochs. The initial learning rate is 0.1 and decays at epoch 100, 150, 200 by a factor of 0.1 (i.e., multiply learning rate by 0.1). On Image Net, the batch size is 512 and the models are trained for 120 epochs. The learning rate decays linearly from an initial value of 0.5 to 0. The number of hidden units in the LSTM cell is set to 300, and the number of layers is set to 1. The full bitwidth B... The prediction bitwidth Bhb... The penalty factor σ... The gating target δ... The coefficient α in the backward pass...