Understanding Straight-Through Estimator in Training Activation Quantized Neural Nets
Authors: Penghang Yin, Jiancheng Lyu, Shuai Zhang, Stanley Osher, Yingyong Qi, Jack Xin
ICLR 2019 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Moreover, we show that a poor choice of STE leads to instability of the training algorithm near certain local minima, which is verified with CIFAR-10 experiments. |
| Researcher Affiliation | Collaboration | Department of Mathematics, University of California, Los Angeles yph@ucla.edu, sjo@math.ucla.edu Department of Mathematics, University of California, Irvine jianchel@uci.edu, jxin@math.uci.edu Qualcomm AI Research, San Diego {shuazhan,yingyong}@qti.qualcomm.com |
| Pseudocode | Yes | Algorithm 1 Coarse gradient descent for learning two-linear-layer CNN with STE µ. |
| Open Source Code | No | The paper does not provide any explicit statements about releasing source code or links to a code repository. |
| Open Datasets | Yes | In this section, we compare the performances of the identity, Re LU and clipped Re LU STEs on MNIST (Le Cun et al., 1998) and CIFAR-10 (Krizhevsky, 2009) benchmarks for 2-bit or 4-bit quantized activations. |
| Dataset Splits | Yes | The experimental results are summarized in Table 1, where we record both the training losses and validation accuracies. ... The schedule of the learning rate is specified in Table 2 in the appendix. |
| Hardware Specification | No | The paper mentions training on LeNet-5, VGG-11, and ResNet-20 models, but does not provide specific hardware details (e.g., GPU/CPU models, memory) used for running the experiments. |
| Software Dependencies | No | The paper mentions using 'stochastic (coarse) gradient descent with momentum = 0.9' and a 'modified batch normalization layer', but does not specify software names with version numbers (e.g., PyTorch, TensorFlow, or specific library versions). |
| Experiment Setup | Yes | The optimizer we use is the stochastic (coarse) gradient descent with momentum = 0.9 for all experiments. We train 50 epochs for Le Net-5... and 200 epochs for VGG-11 and Res Net-20... The schedule of the learning rate is specified in Table 2 in the appendix. |