Hypothesis Search: Inductive Reasoning with Language Models
Authors: Ruocheng Wang, Eric Zelikman, Gabriel Poesia, Yewen Pu, Nick Haber, Noah Goodman
ICLR 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | We conduct experiments on four inductive reasoning datasets: the Abstraction and Reasoning Corpus (ARC), the one-dimensional variant of ARC (1D-ARC), the Syntax-Guided Synthesis (Sy Gu S) dataset, and the List Functions dataset. Our results indicate that explicit hypothesis formation substantially improves performance over the direct prompting (ICL) approach. Ablation studies suggest both levels of abstraction natural-language hypothesis generation and programmatic hypothesis representations are beneficial to performing inductive reasoning tasks. |
| Researcher Affiliation | Collaboration | Ruocheng Wang1 , Eric Zelikman1 , Gabriel Poesia1, Yewen Pu2, Nick Haber1, Noah D. Goodman1 1 Stanford University, 2 Autodesk Research |
| Pseudocode | Yes | The pseudocode for this stage is presented in Algorithm 1. |
| Open Source Code | No | The paper does not contain any explicit statements about releasing source code for the methodology or provide a link to a code repository. |
| Open Datasets | Yes | We conduct experiments on four inductive reasoning datasets: the Abstraction and Reasoning Corpus (ARC), the one-dimensional variant of ARC (1D-ARC), the Syntax-Guided Synthesis (Sy Gu S) dataset, and the List Functions dataset... The Abstraction and Reasoning Corpus (ARC), proposed by Chollet (2019)... 1D-ARC is a one-dimensional adaptation of the original ARC dataset proposed in (Xu et al., 2023b)... The Sy Gu S dataset in the BUSTLE paper contains 89 tasks (Odena et al., 2020)... The List Functions dataset proposed in Rule et al. (2020). |
| Dataset Splits | No | The paper describes validating generated programs against training examples and uses the term "validation" in the context of this process, but it does not specify a separate validation *dataset split* (e.g., 80/10/10 split) from the overall datasets for model tuning or hyperparameter selection. |
| Hardware Specification | No | The paper mentions using "GPT-4" and "GPT-3.5" models, which implies cloud-based inference, but it does not specify any particular hardware (e.g., CPU, GPU models, memory) used for running the experiments or training. |
| Software Dependencies | No | The paper mentions using "gpt-4-0613" and "gpt-3.5-turbo-0301" and "NumPy" (implied by code snippets), but it does not provide specific version numbers for general software dependencies like Python, PyTorch, or other libraries. |
| Experiment Setup | Yes | For hypothesis generation, The prompts are shown in Figure A.8, Figure A.10 and Figure A.11. We set the temperature to be 1.0 and the maximum number of tokens in response to be 200. For program generation and execution feedback, we use a temperature of 0.7 and set the maximum number of tokens to be 1000... This is followed by 3 rounds of execution feedback. |