reproducibilityindex.ai

Hypothesis Search: Inductive Reasoning with Language Models

Authors: Ruocheng Wang, Eric Zelikman, Gabriel Poesia, Yewen Pu, Nick Haber, Noah Goodman

ICLR 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	We conduct experiments on four inductive reasoning datasets: the Abstraction and Reasoning Corpus (ARC), the one-dimensional variant of ARC (1D-ARC), the Syntax-Guided Synthesis (Sy Gu S) dataset, and the List Functions dataset. Our results indicate that explicit hypothesis formation substantially improves performance over the direct prompting (ICL) approach. Ablation studies suggest both levels of abstraction natural-language hypothesis generation and programmatic hypothesis representations are beneficial to performing inductive reasoning tasks.
Researcher Affiliation	Collaboration	Ruocheng Wang1 , Eric Zelikman1 , Gabriel Poesia1, Yewen Pu2, Nick Haber1, Noah D. Goodman1 1 Stanford University, 2 Autodesk Research
Pseudocode	Yes	The pseudocode for this stage is presented in Algorithm 1.
Open Source Code	No	The paper does not contain any explicit statements about releasing source code for the methodology or provide a link to a code repository.
Open Datasets	Yes	We conduct experiments on four inductive reasoning datasets: the Abstraction and Reasoning Corpus (ARC), the one-dimensional variant of ARC (1D-ARC), the Syntax-Guided Synthesis (Sy Gu S) dataset, and the List Functions dataset... The Abstraction and Reasoning Corpus (ARC), proposed by Chollet (2019)... 1D-ARC is a one-dimensional adaptation of the original ARC dataset proposed in (Xu et al., 2023b)... The Sy Gu S dataset in the BUSTLE paper contains 89 tasks (Odena et al., 2020)... The List Functions dataset proposed in Rule et al. (2020).
Dataset Splits	No	The paper describes validating generated programs against training examples and uses the term "validation" in the context of this process, but it does not specify a separate validation dataset split (e.g., 80/10/10 split) from the overall datasets for model tuning or hyperparameter selection.
Hardware Specification	No	The paper mentions using "GPT-4" and "GPT-3.5" models, which implies cloud-based inference, but it does not specify any particular hardware (e.g., CPU, GPU models, memory) used for running the experiments or training.
Software Dependencies	No	The paper mentions using "gpt-4-0613" and "gpt-3.5-turbo-0301" and "NumPy" (implied by code snippets), but it does not provide specific version numbers for general software dependencies like Python, PyTorch, or other libraries.
Experiment Setup	Yes	For hypothesis generation, The prompts are shown in Figure A.8, Figure A.10 and Figure A.11. We set the temperature to be 1.0 and the maximum number of tokens in response to be 200. For program generation and execution feedback, we use a temperature of 0.7 and set the maximum number of tokens to be 1000... This is followed by 3 rounds of execution feedback.