PAL: Program-aided Language Models
Authors: Luyu Gao, Aman Madaan, Shuyan Zhou, Uri Alon, Pengfei Liu, Yiming Yang, Jamie Callan, Graham Neubig
ICML 2023 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | We demonstrate this synergy between a neural LLM and a symbolic interpreter across 13 mathematical, symbolic, and algorithmic reasoning tasks from BIG-Bench Hard and others. In all these natural language reasoning tasks, generating code using an LLM and reasoning using a Python interpreter leads to more accurate results than much larger models. For example, PAL using CODEX achieves state-of-the-art few-shot accuracy on GSM8K, surpassing Pa LM-540B which uses chain-of-thought by absolute 15% top-1. |
| Researcher Affiliation | Collaboration | Luyu Gao * 1 Aman Madaan * 1 Shuyan Zhou * 1 Uri Alon 1 Pengfei Liu 1 2 Yiming Yang 1 Jamie Callan 1 Graham Neubig 1 2 {luyug,amadaan,shuyanzh,ualon,pliu3,yiming,callan,gneubig}@cs.cmu.edu *Equal contribution 1Language Technologies Institute, Carnegie Mellon University, USA 2Inspired Cognition, USA. |
| Pseudocode | Yes | Figure 1: A diagram illustrating PAL: Given a mathematical reasoning question, Chain-of-thought (left) generates intermediate reasoning steps of free-form text. In contrast, Program-aided Language models (PAL, right) generate intermediate steps and Python code. |
| Open Source Code | Yes | Code and data at http://reasonwithpal.com. |
| Open Datasets | Yes | We experiment with three broad classes of reasoning tasks: (1) mathematical problems ( 4.1) from a wide range of datasets including GSM8K (Cobbe et al., 2021), SVAMP (Patel et al., 2021), ASDIV (Miao et al., 2020), and MAWPS (Koncel-Kedziorski et al., 2016); (2) symbolic reasoning ( 4.2) from BIG-Bench Hard (Suzgun et al., 2022); (3) and algorithmic problems ( 4.3) from BIG-Bench Hard as well. Details of all datasets are shown in Appendix I. |
| Dataset Splits | No | Few-shot prompting leverages the strength of large-language models to solve a task with a set of k examples that are provided as part of the test-time input (Brown et al., 2020; Liu et al., 2021; Chowdhery et al., 2022), where k is usually a number in the low single digits. These input-output examples {(xi, yi)}k i=1 are concatenated in a prompt p x1 y1 x2 y2 . . . xk yk . During inference, a test instance xtest is appended to the prompt, and p xtest is passed to the model which attempts to complete p xtest, and thereby generate an answer ytest. Note that such few-shot prompting does not modify the underlying LLM. |
| Hardware Specification | No | The paper states, 'Unless stated otherwise, we used CODEX (code-davinci-002) as our backend LLM for both PAL, DIRECT, and COT.' However, it does not provide any specific details about the hardware (e.g., GPU models, CPU models, memory) used to run the experiments. |
| Software Dependencies | No | The paper mentions using 'a standard Python interpreter' and 'CODEX (code-davinci-002)' as the backend LLM. However, it does not specify concrete version numbers for Python or any other key software libraries or dependencies used in the experiments. |
| Experiment Setup | Yes | We performed greedy decoding from the language model using a temperature of 0. |