Alternating Multi-bit Quantization for Recurrent Neural Networks

Authors: Chen Xu, Jianqiang Yao, Zhouchen Lin, Wenwu Ou, Yuanbin Cao, Zhirong Wang, Hongbin Zha

ICLR 2018 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental We test the quantization for two well-known RNNs, i.e., long short term memory (LSTM) and gated recurrent unit (GRU), on the language models. Compared with the full-precision counter part, by 2-bit quantization we can achieve 16 memory saving and 6 real inference acceleration on CPUs, with only a reasonable loss in the accuracy. By 3-bit quantization, we can achieve almost no loss in the accuracy or even surpass the original model, with 10.5 memory saving and 3 real inference acceleration. Both results beat the exiting quantization works with large margins.
Researcher Affiliation Collaboration Chen Xu1, , Jianqiang Yao2, Zhouchen Lin1,3, , Wenwu Ou2, Yuanbin Cao4, Zhirong Wang2, Hongbin Zha1,3 1 Key Laboratory of Machine Perception (MOE), School of EECS, Peking University, China 2 Search Algorithm Team, Alibaba Group, China 3 Cooperative Medianet Innovation Center, Shanghai Jiao Tong University, China 4 AI-LAB, Alibaba Group, China xuen@pku.edu.cn,tianduo@taobao.com,zlin@pku.edu.cn,santong.oww@taobao.com lingzun.cyb@alibaba-inc.com, qingfeng@taobao.com,zha@cis.pku.edu.cn
Pseudocode Yes Algorithm 1: Binary Search Tree (BST) to determine to optimal code
Open Source Code No The paper does not provide any explicit statements about open-source code availability or links to a code repository for the methodology described.
Open Datasets Yes We first conduct experiments on the Peen Tree Bank (PTB) corpus (Marcus et al., 1993), using the standard preprocessed splits with a 10K size vocabulary (Mikolov, 2012). The PTB dataset contains 929K training tokens, 73K validation tokens, and 82K test tokens.
Dataset Splits Yes The PTB dataset contains 929K training tokens, 73K validation tokens, and 82K test tokens.
Hardware Specification Yes We test it on Intel Xeon E5-2682 v4 @ 2.50 GHz CPU.
Software Dependencies No The paper mentions using specific CPU instructions like _mm256_xor_ps and _popcnt64, and the Intel Math Kernel Library (MKL), but it does not provide specific version numbers for these or other software dependencies like programming languages, frameworks, or libraries that would be necessary for full reproducibility.
Experiment Setup Yes The initial learning rate is set to 20. Every epoch we evaluate on the validation dataset and record the best value. When the validation error exceeds the best record, we decrease learning rate by a factor of 1.2. Training is terminated once the learning rate less than 0.001 or reaching the maximum epochs, i.e., 80. The gradient norm is clipped in the range [ 0.25, 0.25]. We unroll the network for 30 time steps and regularize it with the standard dropout (probability of dropping out units equals to 0.5) (Zaremba et al., 2014). For simplicity of notation, we denote the methods using uniform, balanced, greedy, refined greedy, and our alternating quantization as Uniform, Balanced, Greedy, Refined, and Alternating, respectively. We train with the batch size 20.