Data Efficient Neural Scaling Law via Model Reusing

Authors: Peihao Wang, Rameswar Panda, Zhangyang Wang

ICML 2023 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Our empirical study shows that model reusing can effectively reproduce the power law under the data scarcity regime.
Researcher Affiliation Collaboration 1Department of Electrical and Computer Engineering, University of Texas at Austin, TX, United States 2MIT-IBM Watson Lab, MA, United States.
Pseudocode No The paper describes algorithms and methods in prose and equations but does not include any clearly labeled pseudocode or algorithm blocks.
Open Source Code Yes We release our code at: https://github.com/VITA-Group/ Data-Efficient-Scaling.
Open Datasets Yes For BERT, we adopt the implementation provided by Tan & Bansal (2020), and choose English Wikipedia (Merity et al., 2016) as the training dataset. For Vi T, we utilize the implementation provided by Touvron et al. (2021a). The Image Net1k (Deng et al., 2009) dataset is chosen as our training data collection.
Dataset Splits No The paper describes the training datasets (English Wikipedia, ImageNet1k) and mentions evaluating on a "test set" or "test split" but does not explicitly provide details for a validation split or a complete train/validation/test split for reproducibility.
Hardware Specification No The paper mentions using "computational resources on the Ai MOS Supercomputer" but does not provide specific details such as GPU/CPU models, memory, or other hardware specifications used for experiments.
Software Dependencies No The paper mentions using implementations from other works (Tan & Bansal, 2020; Touvron et al., 2021a) but does not provide specific software dependencies with version numbers (e.g., PyTorch version, Python version).
Experiment Setup Yes For BERT, the batch size is 256, and learning rate is set to 2e-4, while for Ro BERTa, the used batch size is 1024 and learning rate is 8e-4.