Towards Cheaper Inference in Deep Networks with Lower Bit-Width Accumulators
Authors: Yaniv Blumenfeld, Itay Hubara, Daniel Soudry
ICLR 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | For our first set of experiments, we aim to check the effect low-bit accumulators have on residual neural networks He et al. (2016). The results of this experiment are presented in Tab. 2. Table 3: Top-1 Accuracy results: Fine-tuning Res Nets with low-bit accumulators and FP8 weights and activations for Image Net classification. Results are compared with similar models utilizing LBAs in the literature. |
| Researcher Affiliation | Collaboration | Yaniv Blumenfeld Technion, Israel yanivblm6@gmail.com Itay Hubara Intel-Habana Labs, Israel itayhubara@gmail.com Daniel Soudry Technion, Israel daniel.soudry@gmail.com |
| Pseudocode | No | The paper does not contain structured pseudocode or algorithm blocks. |
| Open Source Code | No | The paper mentions using libraries like qtorch and transformers library but does not state that its own code for the described methodology is open-source or publicly available. |
| Open Datasets | Yes | For our first set of experiments, we aim to check the effect low-bit accumulators have on residual neural networks He et al. (2016). To assess the capability of LBA language models, our next set of experiments will focus on the common Bert (Devlin et al., 2018) architecture, and the SQUAD (Question-Answering) task. Table 6: Training a fully-connected NN with 8-bit (M4E3) accumulators for MNIST classification. Our tests were run over the oscar: unshuffled-original-af dataset, with a single tokenizer we trained over the same dataset (vocabulary size of 1000). |
| Dataset Splits | Yes | After loading the networks with pre-trained weights, we proceed to train the network for 5 epochs, using Adam optimizer with a learning rate of η0 = 10 6, and a cosine scheduler, so that η5 = 10 8). Then, we enable underflow events and run a fine-tuning again for a single epoch, using a reduced learning rate of ηUF = 10 7. We used 100 epochs per experiment, which was usually much more than needed for convergence or divergence. For the SQUAD fine-tuning experiment [...] we applied early stopping once the model performance reached its peak (usually after 3 5 epochs). We used the available Huggingface infrastructure (Wolf et al., 2019) to train/ evaluate the model, with Adam optimizer, an initial learning rate for 10 3, a drop-on-plateau scheduler (evaluating every 250 step, γ = 0.1), and a global minibatch size of 64. |
| Hardware Specification | Yes | Each of the Image Net experiments were performed on a single server, containing 8 NVIDIA GPUs (RTX 2080 Ti, RTX A6000). For the SQUAD fine-tuning experiment, we use 8 NVIDIA GPUs (RTX 2080 Ti, RTX A6000). For each experiment with the MNIST setting, we used a single RTX 2080 Ti GPU with a minibatch size of 16. Each of the Masked Language Modelling (MLM) experiments was performed on a single server, containing 8 NVIDIA GPUs (RTX 2080 Ti, RTX A6000, or A100). |
| Software Dependencies | No | The paper mentions software libraries like qtorch (Zhang et al., 2019) and the transformers library (Wolf et al., 2019), but does not provide specific version numbers for these or other software components. |
| Experiment Setup | Yes | We used a total mini-batch size of 256, equally divided across the 8 workers. For optimization, we used the Adam optimizer, with the hyperparameters β = (0.9, 0.999), ϵ = 10 8, λ = 10 4. Dropout was not used. As mentioned in the main text, we used cosine scheduling, the parameters of which depend on the phase in which it was used. The batch size was configured to be 8. For each experiment with the MNIST setting, we used a single RTX 2080 Ti GPU with a minibatch size of 16. Our neural network consisted of 4 fully connected layers (with LBA), and Re LU activations, with all hidden layers being 1024 neurons wide. Outside of the accumulator, all data types were with full precision. Dropout wasn t used (although it was shown to benefit the results slightly), and no data augmentation wasn t used during training. For optimization, we used Adam optimizer, with an initial learning rate of 10 3, with the hyper-parameters: β = (0.9, 0.999), ϵ = 10 8, λ = 0.0, and Step LR scheduler (γ = 0.95). We used 100 epochs per experiment, which was usually much more than needed for convergence or divergence. with Adam optimizer, an initial learning rate for 10 3, a drop-on-plateau scheduler (evaluating every 250 step, γ = 0.1), and a global minibatch size of 64. |