reproducibilityindex.ai

Unit Scaling: Out-of-the-Box Low-Precision Training

Authors: Charlie Blake, Douglas Orr, Carlo Luschi

ICML 2023 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	We demonstrate the efﬁcacy of unit scaling across a range of models and optimisers. We further show that existing models can be adapted to be unit-scaled, training BERTLARGE in FP16 and then FP8 with no degradation in accuracy.
Researcher Affiliation	Industry	Graphcore Research, United Kingdom. Correspondence to: Charlie Blake <charlieb@graphcore.ai>, Douglas Orr <douglaso@graphcore.ai>.
Pseudocode	Yes	Figure 3. Py Torch examples. Left: Scaled projection op, which implicitly constrains βX. Center vs Right: Unscaled vs scaled Transformer FFN layers.
Open Source Code	Yes	The code used in these experiments can be found at https://github.com/graphcore-research/ unit-scaling-demo, alongside a separate notebook implementing a unit-scaled Nano GPT model. We recommend this resource for those looking to understand unit scaling through a simple example implementation. For those interested in using unit scaling in their own models, we also provide a Py Torch library: https://graphcore-research.github.io/ unit-scaling.
Open Datasets	Yes	We perform small-scale experiments on Wiki Text-103 raw character language modelling (Merity et al., 2017). We use the standard BERT masked language model pretraining objective over English Wikipedia articles, and demonstrate downstream performance on SQu AD v1.1 and SQu AD v2.0 (Rajpurkar et al., 2016; 2018).
Dataset Splits	No	The paper does not explicitly provide the exact percentages or sample counts for training, validation, and test splits for the datasets used. While it mentions 'validation bits per character' and 'test BPC', the specific partitioning methodology is not detailed.
Hardware Specification	Yes	Models were trained on IPU hardware (Jia et al., 2019), using either Bow Pod16 or IPU-POD16 Classic machines. On each machine we distribute training across 16 IPUs, using 4-way model parallelism and 4-way pipeline parallelism, with gradient accumulation across pipeline stages.
Software Dependencies	No	The paper mentions software like PyTorch, JAX, and TensorFlow, but does not provide specific version numbers for any of them that were used in their experiments. It cites PyTorch (2023) but this is a general reference, not a version specification for their own setup.
Experiment Setup	Yes	Table A.3. Character language modelling hyperparameters. Table A.5. BERT pre-training hyperparameters.