Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..
Buffer of Thoughts: Thought-Augmented Reasoning with Large Language Models
Authors: Ling Yang, Zhaochen Yu, Tianjun Zhang, Shiyi Cao, Minkai Xu, Wentao Zhang, Joseph E. Gonzalez, Bin CUI
NeurIPS 2024 | Venue PDF | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | We conduct extensive experiments on 10 challenging reasoning-intensive tasks, and achieve significant performance improvements over previous SOTA methods: 11% on Game of 24, 20% on Geometric Shapes and 51% on Checkmate-in-One. |
| Researcher Affiliation | Academia | 1Peking University, 2UC Berkeley, 3Stanford University |
| Pseudocode | Yes | Listing 1: Python template def perform_operation (a, b, operation): # Define the operation logic (e.g., addition , subtraction , etc.). pass def evaluate_sequence (sequence , operations): # Apply operations to the sequence and check if the result meets the criteria. pass def generate_combinations (elements , operations): # Generate all possible combinations of elements and operations. pass def format_solution (sequence , operations): # Format the sequence and operations into a human -readable string. pass def find_solution(input_elements , target_result ): # Data Input Handling # Validate and preprocess input data if necessary. # Core Algorithm Logic for sequence in permutations( input_elements ): for operation_combination in generate_combinations ( sequence , operations): if evaluate_sequence (sequence , operation_combination ) == target_result : # Data Output Formatting return format_solution (sequence , operation_combination ) except Exception as e: # Error Handling # Handle specific exceptions that may occur during evaluation. continue # If no solution is found after all iterations , return a default message. # return No solution found message return # Example usage: input_elements = [1, 7, 10, 3] target_result = 24 print(find_solution(input_elements , target_result )) |
| Open Source Code | Yes | Our project is available at https://github.com/Yang Ling0818/buffer-of-thought-llm |
| Open Datasets | Yes | Datasets and Tasks To evaluate the efficacy of our proposed Buffer of Thoughts and compare with previous methods, we consider a diverse set of tasks and datasets that require varying degrees of mathematical and algorithmic reasoning, domain-specific knowledge, and literary creativity: (a). The Game of 24 from To T [14], where the objective is to form an arithmetic expression that equals 24 using each of four given numbers exactly once; (b). Three BIG-Bench Hard (BBH) [35] tasks: Geometric Shapes, Multi-Step Arithmetic Two, and Word Sorting; (c). Three reasoning tasks directly obtained from the BIG-Bench suite [50]: Checkmate-in-One, Penguins where the task is to answer questions about penguins attributes based on a given table and additional natural language information, and Date Understanding a task that involves inferring dates from natural language descriptions, performing arithmetic operations on dates, and utilizing global knowledge such as the number of days in February; (d). Python Programming Puzzles (P3) [51, 52], a collection of challenging programming puzzles written in Python with varying difficulty levels; (e). Multilingual Grade School Math (MGSM) [33], a multilingual version of the GSM8K dataset [53] featuring translations of a subset of examples into ten typologically diverse languages, including Bengali, Japanese, and Swahili; (f). Shakespearean Sonnet Writing from meta-prompting [15], a novel task where the goal is to write a sonnet following the strict rhyme scheme "ABAB CDCD EFEF GG" and incorporating three provided words verbatim. |
| Dataset Splits | No | We randomly sample 1000 examples from various benchmarks as a test subset and evaluate different methods on this subset. |
| Hardware Specification | Yes | We also use Llama3-8B and Llama3-70B in our analysis part on NVIDIA A100-PCIE-40GB GPU. |
| Software Dependencies | No | The paper mentions general programming concepts and libraries like `itertools` and `chess` in its pseudocode but does not specify software versions (e.g., Python version, library versions) for reproducibility. |
| Experiment Setup | Yes | For the fair comparisons with previous methods, we use GPT-4 as the base model of our Bo T, including the main experiment and the ablation study. We also use Llama3-8B and Llama3-70B in our analysis part on NVIDIA A100-PCIE-40GB GPU. We set a threshold δ (0.5 0.7 is recommended) to determine whether the current task is new. Prompt for Template Distillation (Appendix B.2), Prompt for Instantiated Reasoning (Appendix B.3). |