Large Language Models as Analogical Reasoners
Authors: Michihiro Yasunaga, Xinyun Chen, Yujia Li, Panupong Pasupat, Jure Leskovec, Percy Liang, Ed H. Chi, Denny Zhou
ICLR 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | We evaluate the proposed approach in various reasoning-intensive tasks, including mathematical problem solving in GSM8K and MATH, code generation in Codeforces, and other reasoning tasks in BIG-Bench. |
| Researcher Affiliation | Collaboration | Michihiro Yasunaga,1,2 Xinyun Chen,1 Yujia Li,1 Panupong Pasupat,1 Jure Leskovec,2 Percy Liang,2 Ed H. Chi,1 Denny Zhou1 1 Google Deep Mind 2 Stanford University |
| Pseudocode | Yes | Here is the pseudocode for the prefix product algorithm : prefix = 1 for i in range(n): prefix = prefix * arr[i] |
| Open Source Code | No | The paper does not contain any explicit statements or links indicating that the source code for their proposed method is open-source or publicly available. |
| Open Datasets | Yes | We use popular benchmarks, GSM8K (Cobbe et al., 2021), comprising elementary math word problems, and MATH (Hendrycks et al., 2021b), consisting of advanced math problems from high school math competitions. ... other reasoning tasks in BIG-Bench (Srivastava et al., 2022). |
| Dataset Splits | No | The paper mentions evaluating on 'test problems' and using 'train set' for retrieving exemplars for baselines, but it does not provide specific details on the train/validation/test dataset splits (e.g., percentages, sample counts) for its own experiments. |
| Hardware Specification | No | The paper does not provide specific details about the hardware (e.g., GPU models, CPU types, memory) used for running the experiments. |
| Software Dependencies | No | The paper mentions using Python3 and LLMs like GPT-3.5-turbo, GPT-4, and PaLM2, but it does not specify version numbers for Python or any specific libraries/dependencies used for running the experiments. |
| Experiment Setup | Yes | For each problem, we obtain an output from LLMs using a temperature of 0, and report the accuracy. (Section 5) and For each problem, we sample 10 outputs from LLMs, using a temperature of 0.7. (Section 5) and The number of exemplars to generate (K): Through experimentation, we have found that generating K = 3 to 5 exemplars works the best (more details in 6.5). (Section 4.1) and We let LLMs self-generate K = 5 exemplars for GSM8K and K = 3 exemplars for MATH and BIG-Bench tasks. For Codeforces, we self-generate both knowledge and K =3 exemplars. (Section 5.3) |