reproducibilityindex.ai

On the Inductive Bias of Stacking Towards Improving Reasoning

Authors: Nikunj Saunshi, Stefani Karp, Shankar Krishnan, Sobhan Miryoosefi, Sashank Jakkam Reddi, Sanjiv Kumar

NeurIPS 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	In this work, we examine this fundamental aspect of gradual stacking, going beyond its efﬁciency beneﬁts. We propose a variant of gradual stacking called MIDAS that can speed up language model training by up to 40%. Furthermore we discover an intriguing phenomenon: MIDAS is not only training-efﬁcient but surprisingly also has an inductive bias towards improving downstream tasks, especially tasks that require reasoning abilities like reading comprehension and math problems, despite having similar or slightly worse perplexity compared to baseline training.
Researcher Affiliation	Industry	Nikunj Saunshi Google Research nsaunshi@google.com Stefani Karp Google Research stefanik@google.com Shankar Krishnan Google Research skrishnan@google.com Sobhan Miryooseﬁ Google Research miryoosefi@google.com Sashank J. Reddi Google Research sashank@google.com Sanjiv Kumar Google Research sanjivk@google.com
Pseudocode	Yes	Algorithm 1 MIDAS
Open Source Code	No	We do not provide access to the data and code, but the data and models come from prior works; where differences between our work and prior work appear, we highlight them (Section 3 and Section A).
Open Datasets	Yes	We use a mixture of C4 (57%) [Raffel et al., 2020], Wikipedia (17%), Github (17%), Arxiv (9%); the proportions are motivated by the dataset used for Llama pretraining [Touvron et al., 2023].
Dataset Splits	Yes	Figure 5 reports the validation accuracy on Depth 1 and Depth 2 after ﬁne-tuning on this mixture (tasks Depth 1 (FT) and Depth 2 (FT) ).
Hardware Specification	No	We do not report these details in the submission but can include them in a ﬁnal version.
Software Dependencies	No	All experiments use the Ada Factor optimizer [Shazeer and Stern, 2018] and sequence length of 1280.
Experiment Setup	Yes	We train a 24L decoder-only model with 1.5B parameters using the UL2 objective [Tay et al., 2022] on a mixture of C4, Wikipedia, Arxiv and Github. All models are trained for the same number 500B tokens in the same order, using the same batch size (refer to Appendix A.1 for more details on the training setup). For the 1B and 2B models, we use a cosine learning schedule with a peak learning rate of 0.01 that decays to 0.001 in the end, and use a batch size of 512. For the 8B model we use a peak learning rate of 0.001 and decay it to 0.0001, and use a batch size of 1024. All experiments use the Ada Factor optimizer [Shazeer and Stern, 2018] and sequence length of 1280.