On the Inductive Bias of Stacking Towards Improving Reasoning

Authors: Nikunj Saunshi, Stefani Karp, Shankar Krishnan, Sobhan Miryoosefi, Sashank Jakkam Reddi, Sanjiv Kumar

NeurIPS 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental In this work, we examine this fundamental aspect of gradual stacking, going beyond its efficiency benefits. We propose a variant of gradual stacking called MIDAS that can speed up language model training by up to 40%. Furthermore we discover an intriguing phenomenon: MIDAS is not only training-efficient but surprisingly also has an inductive bias towards improving downstream tasks, especially tasks that require reasoning abilities like reading comprehension and math problems, despite having similar or slightly worse perplexity compared to baseline training.
Researcher Affiliation Industry Nikunj Saunshi Google Research nsaunshi@google.com Stefani Karp Google Research stefanik@google.com Shankar Krishnan Google Research skrishnan@google.com Sobhan Miryoosefi Google Research miryoosefi@google.com Sashank J. Reddi Google Research sashank@google.com Sanjiv Kumar Google Research sanjivk@google.com
Pseudocode Yes Algorithm 1 MIDAS
Open Source Code No We do not provide access to the data and code, but the data and models come from prior works; where differences between our work and prior work appear, we highlight them (Section 3 and Section A).
Open Datasets Yes We use a mixture of C4 (57%) [Raffel et al., 2020], Wikipedia (17%), Github (17%), Arxiv (9%); the proportions are motivated by the dataset used for Llama pretraining [Touvron et al., 2023].
Dataset Splits Yes Figure 5 reports the validation accuracy on Depth 1 and Depth 2 after fine-tuning on this mixture (tasks Depth 1 (FT) and Depth 2 (FT) ).
Hardware Specification No We do not report these details in the submission but can include them in a final version.
Software Dependencies No All experiments use the Ada Factor optimizer [Shazeer and Stern, 2018] and sequence length of 1280.
Experiment Setup Yes We train a 24L decoder-only model with 1.5B parameters using the UL2 objective [Tay et al., 2022] on a mixture of C4, Wikipedia, Arxiv and Github. All models are trained for the same number 500B tokens in the same order, using the same batch size (refer to Appendix A.1 for more details on the training setup). For the 1B and 2B models, we use a cosine learning schedule with a peak learning rate of 0.01 that decays to 0.001 in the end, and use a batch size of 512. For the 8B model we use a peak learning rate of 0.001 and decay it to 0.0001, and use a batch size of 1024. All experiments use the Ada Factor optimizer [Shazeer and Stern, 2018] and sequence length of 1280.