Data Efficient Neural Scaling Law via Model Reusing
Authors: Peihao Wang, Rameswar Panda, Zhangyang Wang
ICML 2023 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Our empirical study shows that model reusing can effectively reproduce the power law under the data scarcity regime. |
| Researcher Affiliation | Collaboration | 1Department of Electrical and Computer Engineering, University of Texas at Austin, TX, United States 2MIT-IBM Watson Lab, MA, United States. |
| Pseudocode | No | The paper describes algorithms and methods in prose and equations but does not include any clearly labeled pseudocode or algorithm blocks. |
| Open Source Code | Yes | We release our code at: https://github.com/VITA-Group/ Data-Efficient-Scaling. |
| Open Datasets | Yes | For BERT, we adopt the implementation provided by Tan & Bansal (2020), and choose English Wikipedia (Merity et al., 2016) as the training dataset. For Vi T, we utilize the implementation provided by Touvron et al. (2021a). The Image Net1k (Deng et al., 2009) dataset is chosen as our training data collection. |
| Dataset Splits | No | The paper describes the training datasets (English Wikipedia, ImageNet1k) and mentions evaluating on a "test set" or "test split" but does not explicitly provide details for a validation split or a complete train/validation/test split for reproducibility. |
| Hardware Specification | No | The paper mentions using "computational resources on the Ai MOS Supercomputer" but does not provide specific details such as GPU/CPU models, memory, or other hardware specifications used for experiments. |
| Software Dependencies | No | The paper mentions using implementations from other works (Tan & Bansal, 2020; Touvron et al., 2021a) but does not provide specific software dependencies with version numbers (e.g., PyTorch version, Python version). |
| Experiment Setup | Yes | For BERT, the batch size is 256, and learning rate is set to 2e-4, while for Ro BERTa, the used batch size is 1024 and learning rate is 8e-4. |