reproducibilityindex.ai

Buffer of Thoughts: Thought-Augmented Reasoning with Large Language Models

Authors: Ling Yang, Zhaochen Yu, Tianjun Zhang, Shiyi Cao, Minkai Xu, Wentao Zhang, Joseph E. Gonzalez, Bin CUI

NeurIPS 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	We conduct extensive experiments on 10 challenging reasoning-intensive tasks, and achieve significant performance improvements over previous SOTA methods: 11% on Game of 24, 20% on Geometric Shapes and 51% on Checkmate-in-One.
Researcher Affiliation	Academia	1Peking University, 2UC Berkeley, 3Stanford University
Pseudocode	Yes	Listing 1: Python template def perform_operation (a, b, operation): # Define the operation logic (e.g., addition , subtraction , etc.). pass def evaluate_sequence (sequence , operations): # Apply operations to the sequence and check if the result meets the criteria. pass def generate_combinations (elements , operations): # Generate all possible combinations of elements and operations. pass def format_solution (sequence , operations): # Format the sequence and operations into a human -readable string. pass def find_solution(input_elements , target_result ): # Data Input Handling # Validate and preprocess input data if necessary. # Core Algorithm Logic for sequence in permutations( input_elements ): for operation_combination in generate_combinations ( sequence , operations): if evaluate_sequence (sequence , operation_combination ) == target_result : # Data Output Formatting return format_solution (sequence , operation_combination ) except Exception as e: # Error Handling # Handle specific exceptions that may occur during evaluation. continue # If no solution is found after all iterations , return a default message. # return No solution found message return # Example usage: input_elements = [1, 7, 10, 3] target_result = 24 print(find_solution(input_elements , target_result ))
Open Source Code	Yes	Our project is available at https://github.com/Yang Ling0818/buffer-of-thought-llm
Open Datasets	Yes	Datasets and Tasks To evaluate the efficacy of our proposed Buffer of Thoughts and compare with previous methods, we consider a diverse set of tasks and datasets that require varying degrees of mathematical and algorithmic reasoning, domain-specific knowledge, and literary creativity: (a). The Game of 24 from To T [14], where the objective is to form an arithmetic expression that equals 24 using each of four given numbers exactly once; (b). Three BIG-Bench Hard (BBH) [35] tasks: Geometric Shapes, Multi-Step Arithmetic Two, and Word Sorting; (c). Three reasoning tasks directly obtained from the BIG-Bench suite [50]: Checkmate-in-One, Penguins where the task is to answer questions about penguins attributes based on a given table and additional natural language information, and Date Understanding a task that involves inferring dates from natural language descriptions, performing arithmetic operations on dates, and utilizing global knowledge such as the number of days in February; (d). Python Programming Puzzles (P3) [51, 52], a collection of challenging programming puzzles written in Python with varying difficulty levels; (e). Multilingual Grade School Math (MGSM) [33], a multilingual version of the GSM8K dataset [53] featuring translations of a subset of examples into ten typologically diverse languages, including Bengali, Japanese, and Swahili; (f). Shakespearean Sonnet Writing from meta-prompting [15], a novel task where the goal is to write a sonnet following the strict rhyme scheme "ABAB CDCD EFEF GG" and incorporating three provided words verbatim.
Dataset Splits	No	We randomly sample 1000 examples from various benchmarks as a test subset and evaluate different methods on this subset.
Hardware Specification	Yes	We also use Llama3-8B and Llama3-70B in our analysis part on NVIDIA A100-PCIE-40GB GPU.
Software Dependencies	No	The paper mentions general programming concepts and libraries like `itertools` and `chess` in its pseudocode but does not specify software versions (e.g., Python version, library versions) for reproducibility.
Experiment Setup	Yes	For the fair comparisons with previous methods, we use GPT-4 as the base model of our Bo T, including the main experiment and the ablation study. We also use Llama3-8B and Llama3-70B in our analysis part on NVIDIA A100-PCIE-40GB GPU. We set a threshold δ (0.5 0.7 is recommended) to determine whether the current task is new. Prompt for Template Distillation (Appendix B.2), Prompt for Instantiated Reasoning (Appendix B.3).