S$^3$: Sign-Sparse-Shift Reparametrization for Effective Training of Low-bit Shift Networks
Authors: Xinlin Li, Bang Liu, Yaoliang Yu, Wulong Liu, Chunjing XU, Vahid Partovi Nia
NeurIPS 2021 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | We evaluate our approach on the Image Net dataset. While all previous methods require at least 5-bit weight representation to achieve the same performance as the full-precision neural networks on large datasets such as Image Net, our experimental results show that our proposed method surpasses all previous methods, pushes this boundary further to 3-bits. |
| Researcher Affiliation | Collaboration | Xinlin Li1, Bang Liu2, Yaoliang Yu3, Wulong Liu1, Chunjing Xu1, and Vahid Partovi Nia1 1Noah s Ark Lab, Huawei Technologies. 2Department of Computer Science and Operations Research (DIRO), University of Montreal. 3Cheriton School of Computer Science, University of Waterloo. |
| Pseudocode | No | The paper does not contain structured pseudocode or algorithm blocks. Figure 3 illustrates a process but is not a formal pseudocode block. |
| Open Source Code | No | Did you include the code, data, and instructions needed to reproduce the main experimental results (either in the supplemental material or as a URL)? [No] We will provide the code upon publication. |
| Open Datasets | Yes | We evaluate our proposed method on ILSVRC2012 [7] dataset with different bit-widths to demonstrate the effectiveness and robustness of our method. We use Res Net-18 and Res Net-50 as our backbone with the same data augmentation and pre-processing strategy proposed in [13]. |
| Dataset Splits | Yes | We evaluate our proposed method on ILSVRC2012 [7] dataset... We use Res Net-18 and Res Net-50 as our backbone with the same data augmentation and pre-processing strategy proposed in [13]. All models are converged and reach a reasonable validation accuracy on CIFAR10 (> 91%). |
| Hardware Specification | No | Did you include the total amount of compute and the type of resources used (e.g., type of GPUs, internal cluster, or cloud provider)? [No] , but we gave standard training details in section 5. |
| Software Dependencies | No | The paper does not provide specific software dependencies with version numbers. |
| Experiment Setup | Yes | We train the networks with 200 epochs utilizing the cosine learning rate, and the initial learning rate is 1e-3. The networks are optimized with SGD optimizer, and the momentum and weight decay are set to 0.9 and 1e-4 respectively. The hyper-parameter α of dense weight regularizer is set to 1e-5 without using decay scheduler. |