Unit Scaling: Out-of-the-Box Low-Precision Training
Authors: Charlie Blake, Douglas Orr, Carlo Luschi
ICML 2023 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | We demonstrate the efficacy of unit scaling across a range of models and optimisers. We further show that existing models can be adapted to be unit-scaled, training BERTLARGE in FP16 and then FP8 with no degradation in accuracy. |
| Researcher Affiliation | Industry | Graphcore Research, United Kingdom. Correspondence to: Charlie Blake <charlieb@graphcore.ai>, Douglas Orr <douglaso@graphcore.ai>. |
| Pseudocode | Yes | Figure 3. Py Torch examples. Left: Scaled projection op, which implicitly constrains βX. Center vs Right: Unscaled vs scaled Transformer FFN layers. |
| Open Source Code | Yes | The code used in these experiments can be found at https://github.com/graphcore-research/ unit-scaling-demo, alongside a separate notebook implementing a unit-scaled Nano GPT model. We recommend this resource for those looking to understand unit scaling through a simple example implementation. For those interested in using unit scaling in their own models, we also provide a Py Torch library: https://graphcore-research.github.io/ unit-scaling. |
| Open Datasets | Yes | We perform small-scale experiments on Wiki Text-103 raw character language modelling (Merity et al., 2017). We use the standard BERT masked language model pretraining objective over English Wikipedia articles, and demonstrate downstream performance on SQu AD v1.1 and SQu AD v2.0 (Rajpurkar et al., 2016; 2018). |
| Dataset Splits | No | The paper does not explicitly provide the exact percentages or sample counts for training, validation, and test splits for the datasets used. While it mentions 'validation bits per character' and 'test BPC', the specific partitioning methodology is not detailed. |
| Hardware Specification | Yes | Models were trained on IPU hardware (Jia et al., 2019), using either Bow Pod16 or IPU-POD16 Classic machines. On each machine we distribute training across 16 IPUs, using 4-way model parallelism and 4-way pipeline parallelism, with gradient accumulation across pipeline stages. |
| Software Dependencies | No | The paper mentions software like PyTorch, JAX, and TensorFlow, but does not provide specific version numbers for any of them that were used in their experiments. It cites PyTorch (2023) but this is a general reference, not a version specification for their own setup. |
| Experiment Setup | Yes | Table A.3. Character language modelling hyperparameters. Table A.5. BERT pre-training hyperparameters. |