reproducibilityindex.ai

Thought of Search: Planning with Language Models Through The Lens of Efficiency

Authors: Michael Katz, Harsha Kokel, Kavitha Srinivas, Shirin Sohrabi Araghi

NeurIPS 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	In this work, we alleviate this gap. We analyse these properties of using LLMs for planning and highlight that recent trends abandon both soundness and completeness for the sake of inefﬁciency. We propose a signiﬁcantly more efﬁcient approach that can, at the same time, maintain both soundness and completeness. We exemplify on four representative search problems, comparing to the LLM-based solutions from the literature that attempt to solve these problems. We show that by using LLMs to produce the code for the search components we can solve the entire datasets with 100% accuracy with only a few calls to the LLM. In contrast, the compared approaches require hundreds of thousands of calls and achieve signiﬁcantly lower accuracy.
Researcher Affiliation	Industry	Michael Katz IBM T. J. Watson Research Center Yorktown Heights, NY 10598 michael.katz1@ibm.com Harsha Kokel IBM Almaden Research Center San Jose, CA 95120 harsha.kokel@ibm.com Kavitha Srinivas IBM T. J. Watson Research Center Yorktown Heights, NY 10598 kavitha.srinivas@ibm.com Shirin Sohrabi IBM T. J. Watson Research Center Yorktown Heights, NY 10598 ssohrab@us.ibm.com
Pseudocode	Yes	def bfs(state, successor_states, is_goal): expanded = 0 generated = 0 s = state Q = [tuple((s, None))] Closed = dict() while len(Q) > 0: # Get the top from the queue s, parent = Q[0][0], Q[0][1] del Q[0] c = _str(s) if c in Closed: continue Closed[c] = parent if is_goal(s): return reconstruct_plan(s, Closed), expanded, generated expanded += 1 for t in successor_states(s): Q.append(tuple((t,s))) generated += 1 return None, expanded, generated
Open Source Code	Yes	The code obtained from GPT-4 in all 5 correspondences is provided in the appendix.
Open Datasets	Yes	For each of the 5 tested successor functions, BFS found a solution in the 1361 cases where a solution exists and report that no solution exists in the only one unsolvable case, a 100% success rate. The total time to solve all 1362 problems varies over these 5 cases from 1.92s to 6.83s in our naive BFS implementation, hinting that some successor functions can be more efﬁcient than other. This is comparable to a single LLM evaluation time; which is 7s for GPT-4 Chat [15]. Note that the generated successor functions are generic enough to be able to solve the generalized version of the 24game, Countdown [5], with only minimal adaptation to the goal test.
Dataset Splits	No	The paper does not specify distinct training, validation, or test splits for the datasets. It evaluates its approach on the 'entire suite' or 'entire collection' of instances/questions from existing datasets, using them as the full evaluation set without further partitioning for typical training/validation purposes.
Hardware Specification	Yes	The search with the obtained code was run locally, on a machine with 2.3 GHz 8-Core Intel Core i9 CPU, no GPUs were used.
Software Dependencies	No	The paper mentions using Python and the GPT-4 model, but does not provide specific version numbers for Python, its libraries, or any other software components used for running the experiments. It only refers to 'GPT-4 model [14], in a chat mode'.
Experiment Setup	No	The paper states that it uses "standard implementations" of BFS and DFS algorithms, but does not specify any particular hyperparameters, initialization details, or system-level training settings typically found in experimental setups for machine learning models. The parameters listed in Table 1 are for the other approaches being compared in the discussion section, not for their own proposed approach's experimental setup.