Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

THOUGHT PROPAGATION: AN ANALOGICAL APPROACH TO COMPLEX REASONING WITH LARGE LANGUAGE MODELS

Authors: Junchi Yu, Ran He, Zhitao Ying

ICLR 2024 | Venue PDF | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Experiments across three challenging tasks demonstrate TP enjoys a substantial improvement over the baselines by an average of 12% absolute increase in finding the optimal solutions in Shortest-path Reasoning, 13% improvement of human preference in Creative Writing, and 15% enhancement in the task completion rate of LLM-Agent Planning.
Researcher Affiliation Academia Junchi Yu & Ran He MAIS& CRIPAC, Institute of Automation Chinese Academy of Sciences University of Chinese Academy of Sciences Beijing, China EMAIL, EMAIL Rex Ying Department of Computer Sciences Yale University New Haven, USA EMAIL
Pseudocode Yes Init Path = [0] While not reach Node 8 and not exceed max steps: Current_node=Path[-1] Next_node_set=LLM_Neighbor_search(Current_node) Best_next_node=LLM_Evaluate(Next_node_set) Path.append(Best_next_node) print(Path)
Open Source Code Yes Code is available on https://github.com/Samyu0304/thought-propagation.
Open Datasets Yes We use ALFWorld (Shridhar et al., 2021) game suite to instantiate the LLM-Agent Planning task with 134 environments.
Dataset Splits No The paper mentions '0-shot, 1-shot, and 5-shot prompting settings' and '100 test instances' or '134 unseen environments for evaluation', but does not provide specific train/validation/test dataset splits or cross-validation details for reproducibility.
Hardware Specification No The paper mentions using LLM backends such as Pa LM 2, GPT-3.5, and GPT-4, but does not provide specific hardware details like GPU models, CPU types, or memory used for running the experiments.
Software Dependencies No The paper mentions 'Python' for graph generation and various LLM backends (GPT-3.5, GPT-4, PaLM 2), but does not provide specific version numbers for software libraries, frameworks, or dependencies used in the experiments.
Experiment Setup No The paper describes prompting settings (0-shot, 1-shot, 5-shot) and LLM models used, but does not provide specific hyperparameter values (e.g., learning rate, batch size) or detailed system-level training configurations.