PAL: Program-aided Language Models

Authors: Luyu Gao, Aman Madaan, Shuyan Zhou, Uri Alon, Pengfei Liu, Yiming Yang, Jamie Callan, Graham Neubig

ICML 2023 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental We demonstrate this synergy between a neural LLM and a symbolic interpreter across 13 mathematical, symbolic, and algorithmic reasoning tasks from BIG-Bench Hard and others. In all these natural language reasoning tasks, generating code using an LLM and reasoning using a Python interpreter leads to more accurate results than much larger models. For example, PAL using CODEX achieves state-of-the-art few-shot accuracy on GSM8K, surpassing Pa LM-540B which uses chain-of-thought by absolute 15% top-1.
Researcher Affiliation Collaboration Luyu Gao * 1 Aman Madaan * 1 Shuyan Zhou * 1 Uri Alon 1 Pengfei Liu 1 2 Yiming Yang 1 Jamie Callan 1 Graham Neubig 1 2 {luyug,amadaan,shuyanzh,ualon,pliu3,yiming,callan,gneubig}@cs.cmu.edu *Equal contribution 1Language Technologies Institute, Carnegie Mellon University, USA 2Inspired Cognition, USA.
Pseudocode Yes Figure 1: A diagram illustrating PAL: Given a mathematical reasoning question, Chain-of-thought (left) generates intermediate reasoning steps of free-form text. In contrast, Program-aided Language models (PAL, right) generate intermediate steps and Python code.
Open Source Code Yes Code and data at http://reasonwithpal.com.
Open Datasets Yes We experiment with three broad classes of reasoning tasks: (1) mathematical problems ( 4.1) from a wide range of datasets including GSM8K (Cobbe et al., 2021), SVAMP (Patel et al., 2021), ASDIV (Miao et al., 2020), and MAWPS (Koncel-Kedziorski et al., 2016); (2) symbolic reasoning ( 4.2) from BIG-Bench Hard (Suzgun et al., 2022); (3) and algorithmic problems ( 4.3) from BIG-Bench Hard as well. Details of all datasets are shown in Appendix I.
Dataset Splits No Few-shot prompting leverages the strength of large-language models to solve a task with a set of k examples that are provided as part of the test-time input (Brown et al., 2020; Liu et al., 2021; Chowdhery et al., 2022), where k is usually a number in the low single digits. These input-output examples {(xi, yi)}k i=1 are concatenated in a prompt p x1 y1 x2 y2 . . . xk yk . During inference, a test instance xtest is appended to the prompt, and p xtest is passed to the model which attempts to complete p xtest, and thereby generate an answer ytest. Note that such few-shot prompting does not modify the underlying LLM.
Hardware Specification No The paper states, 'Unless stated otherwise, we used CODEX (code-davinci-002) as our backend LLM for both PAL, DIRECT, and COT.' However, it does not provide any specific details about the hardware (e.g., GPU models, CPU models, memory) used to run the experiments.
Software Dependencies No The paper mentions using 'a standard Python interpreter' and 'CODEX (code-davinci-002)' as the backend LLM. However, it does not specify concrete version numbers for Python or any other key software libraries or dependencies used in the experiments.
Experiment Setup Yes We performed greedy decoding from the language model using a temperature of 0.