reproducibilityindex.ai

Large Language Models as Analogical Reasoners

Authors: Michihiro Yasunaga, Xinyun Chen, Yujia Li, Panupong Pasupat, Jure Leskovec, Percy Liang, Ed H. Chi, Denny Zhou

ICLR 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	We evaluate the proposed approach in various reasoning-intensive tasks, including mathematical problem solving in GSM8K and MATH, code generation in Codeforces, and other reasoning tasks in BIG-Bench.
Researcher Affiliation	Collaboration	Michihiro Yasunaga,1,2 Xinyun Chen,1 Yujia Li,1 Panupong Pasupat,1 Jure Leskovec,2 Percy Liang,2 Ed H. Chi,1 Denny Zhou1 1 Google Deep Mind 2 Stanford University
Pseudocode	Yes	Here is the pseudocode for the prefix product algorithm : prefix = 1 for i in range(n): prefix = prefix * arr[i]
Open Source Code	No	The paper does not contain any explicit statements or links indicating that the source code for their proposed method is open-source or publicly available.
Open Datasets	Yes	We use popular benchmarks, GSM8K (Cobbe et al., 2021), comprising elementary math word problems, and MATH (Hendrycks et al., 2021b), consisting of advanced math problems from high school math competitions. ... other reasoning tasks in BIG-Bench (Srivastava et al., 2022).
Dataset Splits	No	The paper mentions evaluating on 'test problems' and using 'train set' for retrieving exemplars for baselines, but it does not provide specific details on the train/validation/test dataset splits (e.g., percentages, sample counts) for its own experiments.
Hardware Specification	No	The paper does not provide specific details about the hardware (e.g., GPU models, CPU types, memory) used for running the experiments.
Software Dependencies	No	The paper mentions using Python3 and LLMs like GPT-3.5-turbo, GPT-4, and PaLM2, but it does not specify version numbers for Python or any specific libraries/dependencies used for running the experiments.
Experiment Setup	Yes	For each problem, we obtain an output from LLMs using a temperature of 0, and report the accuracy. (Section 5) and For each problem, we sample 10 outputs from LLMs, using a temperature of 0.7. (Section 5) and The number of exemplars to generate (K): Through experimentation, we have found that generating K = 3 to 5 exemplars works the best (more details in 6.5). (Section 4.1) and We let LLMs self-generate K = 5 exemplars for GSM8K and K = 3 exemplars for MATH and BIG-Bench tasks. For Codeforces, we self-generate both knowledge and K =3 exemplars. (Section 5.3)