reproducibilityindex.ai

Specializing Smaller Language Models towards Multi-Step Reasoning

Authors: Yao Fu, Hao Peng, Litu Ou, Ashish Sabharwal, Tushar Khot

ICML 2023 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	We conduct model specialization on two model families: the raw pretrained checkpoints, and their instruction-tuned checkpoints (recall that the instruction-tuned checkpoints are generally more capable than the raw pretrained checkpoints, Fig 1A). Specif-ically, we consider the raw pretrained T5 Base (250M)/ Large (760M)/ XL (3B)/ XXL (11B), and the instruction-tuned Flan T5s. In Sec. 4.1, we validate our main hypothesis that large models can perform well on a wide range of tasks while smaller model s ability can be moved from generic abilities to a specialized target ability. Specifically, we show model specialization can indeed improve Co T math performance for Flan T5-Base/ Large/ XL/ XXL, while paying the price of generic abilities, i.e., losing all Co T abilities on Big Bench Hard and a large portion of answer-only (AO) abilities.
Researcher Affiliation	Academia	1University of Edinburgh 2Allen Institute for AI. Correspondence to: Yao Fu <yao.fu@ed.ac.uk>, Tushar Khot <tushark@allenai.org>.
Pseudocode	No	The information is insufficient. The paper provides a mathematical recursion for dynamic programming but no explicitly labeled 'Pseudocode' or 'Algorithm' block, nor structured steps in a code-like format.
Open Source Code	Yes	Our code is at https://github.com/FranxYao/FlanT5-CoT-Specialization
Open Datasets	Yes	We use GSM8K (Cobbe et al., 2021) as our seed dataset... We test the model s out-of-distribution performance on Multi Arith, ASDiv, and SVAMP (Wei et al., 2022b).
Dataset Splits	Yes	None of the datasets has official train-dev-test splits, so we randomly sample 500 instances as the validation set from the training set, and use the remaining instances (800 for GSM8K, 400 for Multi Arith, 18K for ASDiv, 500 for SVAMP) as the test set.
Hardware Specification	No	The information is insufficient. The paper mentions general 'compute' needs and model sizes but does not specify the particular hardware (e.g., GPU/CPU models, memory) used for running the experiments.
Software Dependencies	No	The information is insufficient. The paper mentions models like Flan T5 and GPT-3.5 and APIs, but it does not list specific software dependencies with version numbers (e.g., Python, PyTorch/TensorFlow versions, or other libraries).
Experiment Setup	Yes	Given a training question corpora, we use code-davinci-002 to generate 40 new Co T solutions then take the ones that lead to the correct answers as our training data. ...we further consider three additional data formats: (1). in-context answer-only (Fig. 1 B1), where we do not use the Co T data... (2). in-context chain-of-thought (Fig. 1 B2)... (3). zero-shot answer-only... In terms of training objectives, in the distillation literature, there are typically two types of distillation approaches: (1). sample matching... (2). distribution matching... The objective of the experiments is to see to what extent we can lift up the scaling curve of smaller models math Co T performance and what is the price of it. We conduct model specialization on two model families: the raw pretrained checkpoints, and their instruction-tuned checkpoints... Smaller models need to see the data more times than larger models (A2 has 3 epochs and A1 has 2).