Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..
Training with Mixed-Precision Floating-Point Assignments
Authors: Wonyeol Lee, Rahul Sharma, Alex Aiken
TMLR 2023 | Venue PDF | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | We evaluate our technique on image classification tasks by training convolutional networks on CIFAR-10, CIFAR-100, and Image Net. Our method typically provides > 2 memory reduction over a baseline precision assignment while preserving training accuracy, and gives further reductions by trading offaccuracy. |
| Researcher Affiliation | Collaboration | Wonyeol Lee EMAIL Stanford University, USA Rahul Sharma EMAIL Microsoft Research, India Alex Aiken EMAIL Stanford University, USA |
| Pseudocode | Yes | Algorithm 1: Computing π with precision demotion Input: (f1, . . . , fn), (fn+1, . . . , fm), C, r /* Tensor grouping */ k = 1; Tk = for i = 1 to m do Tk = Tk {vi, dvi} if k n then { Tk = Tk {θi, dθi} } if fi is GEMM then { k = k + 1; Tk = } end /* Precision demotion */ (T 1, . . . , T k) = sort(T1, . . . , Tk) by decreasing size π(t) = C(t, hi) for all t TS for j = 1 to k do if ratiolo(π) r then { break } π(t) = C(t, lo) for all t T j end return π |
| Open Source Code | No | We have implemented our precision assignment technique using Py Torch (Paszke et al., 2019). Given a model and loss network, and a dataset, our implementation takes as parameters a precision-candidate assignment C and a lower bound r on the low-precision ratio; it then automatically assigns precisions to tensors (appearing in training) according to our technique and uses those assigned precisions in gradient computations. |
| Open Datasets | Yes | As benchmarks for our experiments, we use the image classification task and three datasets for the task: CIFAR-10 and CIFAR-100 (Krizhevsky, 2009), and Image Net (Russakovsky et al., 2015). |
| Dataset Splits | Yes | We train all models in a standard way: we apply dynamic loss scaling (a standard technique used in low-precision floating-point training; see 4.2 for details) except for 32-bit training, and use standard settings (e.g., learning rate); see Appendix B for details. |
| Hardware Specification | Yes | All experiments were performed on NVIDIA V100 GPUs; total compute time for all experiments was 1,081 GPU days. |
| Software Dependencies | No | We have implemented our precision assignment technique using Py Torch (Paszke et al., 2019). ... We implement the rounding functions based on the QPy Torch library (Zhang et al., 2019), but a few extensions are required, e.g., to support exponent bias and signal overflows for dynamic loss scaling. |
| Experiment Setup | Yes | Four models on CIFAR-10 and CIFAR-100: We train the four models with a standard setup (kuangliu, 2021). In particular, we run the (non-Nesterov) SGD optimizer for 200 epochs with minibatch size of 128 (over 1 GPU), learning rate of 0.1, momentum of 0.9, weight decay of 5 * 10^-4, and the cosine annealing scheduler for learning rate. For dynamic loss scaling, we use initial scale of 2^16, growth factor of 2, back-offfactor of 0.5, and growth interval of 1 epoch, as suggested in Py Torch (Py Torch, 2022a). Shuffle Net-v2 on Image Net: We train the model with the default setup given in Py Torch s Git Hub repository (Py Torch, 2022c), except that we use larger minibatch size and learning rate as in (Goyal et al., 2017; Kalamkar et al., 2019; Krizhevsky, 2014; Py Torch, 2022d) to reduce the wall-clock time of training. In particular, we run the (non-Nesterov) SGD optimizer for 90 epochs with minibatch size of 1024 (over 16 GPUs), learning rate of 0.4, momentum of 0.9, weight decay of 10^-4, and the cosine annealing scheduler for learning rate. For dynamic loss scale, we use initial scale of 2^16, growth factor of 2, back-offfactor of 0.5, and growth interval of 0.5 epoch, as suggested in Py Torch (Py Torch, 2022a). |