Reproducibility Index

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..

HypoBootstrap: A Bootstrapping Framework for Inductive Reasoning

Authors: Si Chen, Yifei Li, Richong Zhang

NeurIPS 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	Empirical studies on four inductive reasoning scenarios of different natures, involving causal induction, concept learning, grammar learning, and abstract reasoning, demonstrate that Hypo Bootstrap significantly outperforms existing methods. We conduct experiments on four inductive reasoning scenarios with varying nature: causal induction, concept learning, grammar learning, and abstract reasoning. Empirical results demonstrate significant improvement of Hypo Bootstrap compared with previous works and verify that both the bootstrap generation and bootstrap confirmation are effective.
Researcher Affiliation	Academia	Si Chen , Yifei Li , Richong Zhang SKLCCSE, Beihang University, Beijing, China EMAIL, EMAIL, EMAIL
Pseudocode	Yes	The pseudo-code is given in Appendix B.1. As depicted in Figure 2, the framework generates the object hypothesis, relational hypothesis, and functional hypothesis in a bootstrapping manner, with confirmations embedded into the generation procedure. Algorithm 1 Hypo Bootstrap
Open Source Code	Yes	1Code is available at https://github.com/chensi99/Hypo Bootstrap.
Open Datasets	Yes	We use four inductive reasoning datasets with varying natures: causal induction, concept learning, grammar learning, and abstract reasoning. See Qiu et al. [20] and our codebase for examples and more details. The Abstract Causal REasoning dataset (ACRE) [27] is a diagnostic benchmark for causal induction. List Functions dataset [21] is initially designed for psychological investigation of concept learning. Mini SCAN [12] requires the ability of sequence-to-sequence learning. The Abstract Reasoning Corpus (ARC) [1] is an advanced benchmark for measuring general fluid intelligence. Mini ARC [10] is a small-scale version of ARC, where inputs and outputs are 5x5 visual grids.
Dataset Splits	No	Evaluation is conducted on a hold-out set of unobserved evidence, separated from the observed evidence used to infer rules. The hold-out evidence ensures that E itself would not be an acceptable rule and a good inductive reasoner should have inferred the underlying mapping behind the observed evidence.
Hardware Specification	No	Since we mainly use public API services, the computer resources cannot be determined. Instead, we have discussed the token consumption in Section 6.
Software Dependencies	No	All experiments use GPT-4 [18]. We also include results on Deep Seek-V3 [3] in Appendix C.1. The agents prompts used for GPT-4 and List Functions dataset are shown in Table 5
Experiment Setup	Yes	All experiments use GPT-4 [18]. We also include results on Deep Seek-V3 [3] in Appendix C.1. For a fair comparison, we ensure that the number of functional hypotheses generated across different methods remains the same, i.e., N = 1 in HR and K = T in Mo C. This also ensures that the frequency of unit testing in training remains the same across all methods. In addition, LLM s decoding temperature in Hypo Bootstrap is set to 0, i.e., greedy decoding. Our experimental setup follows HR [20], detailed in this subsection.