reproducibilityindex.ai

Can Language Models Learn to Skip Steps?

Authors: Tengxiao Liu, Qipeng Guo, Xiangkun Hu, Cheng Jiayang, Yue Zhang, Xipeng Qiu, Zheng Zhang

NeurIPS 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	Empirical results indicate that models can develop the step skipping ability under our guidance. Moreover, after fine-tuning on expanded datasets that include both complete and skipped reasoning sequences, the models can not only resolve tasks with increased efficiency without sacrificing accuracy, but also exhibit comparable and even enhanced generalization capabilities in out-of-domain scenarios.
Researcher Affiliation	Collaboration	Tengxiao Liu Qipeng Guo Xiangkun Hu Cheng Jiayang Yue Zhang Xipeng Qiu Zheng Zhang Fudan University UC Santa Barbara Shanghai AI Laboratory Westlake University Amazon AWS AI
Pseudocode	No	The paper describes its framework and methods textually and through figures, but it does not include any explicitly labeled 'Pseudocode' or 'Algorithm' blocks, nor does it present structured steps in a code-like format.
Open Source Code	Yes	Code and data are publicly available at: https://github.com/tengxiaoliu/LM_skip.
Open Datasets	Yes	We design three tasks to investigate the model s step skipping behavior (Figure 3). In each task, the intermediate steps needed to solve these problems are explicitly detailed and well-defined, facilitating a clear analysis of the model s predictions. ... For training and in-domain test data, we only consider additions involving numbers with up to 3 digits. ... We additionally consider long-form symbolic directional reasoning, which poses a challenge for direct solution and necessitates continuous reasoning steps to arrive at the answer. This task provides an initial direction and a list of turning actions. The desired answer is the final facing direction. For training and in-domain test set, we consider questions that contain 10 actions.
Dataset Splits	No	The paper specifies 'training' and 'in-domain test' datasets, along with 'OOD-easy' and 'OOD-hard' for testing. However, it does not explicitly mention or detail a separate 'validation' dataset split for hyperparameter tuning or early stopping, adhering strictly to the prompt's requirement for 'training/test/validation dataset splits'.
Hardware Specification	Yes	All experiments are conducted on eight V100 GPUs each with 32GB memory.
Software Dependencies	Yes	For all our experiments, we use Llama 2 (7B parameters) [43] and phi-3-mini (3.8B parameters, with context length of 4K) [1] as our base model. We train the model using a learning rate of 5e-6 for 2 epochs with the Adam W optimizer [29]. ... For the Analog of Algebra task... we utilize the Sym Py [33] library.
Experiment Setup	Yes	For all our experiments, we use Llama 2 (7B parameters) [43] and phi-3-mini (3.8B parameters, with context length of 4K) [1] as our base model. We train the model using a learning rate of 5e-6 for 2 epochs with the Adam W optimizer [29]. During inference, we employ greedy decoding. We run our experiments with three different random seeds and report the average and standard deviation.