Thought of Search: Planning with Language Models Through The Lens of Efficiency
Authors: Michael Katz, Harsha Kokel, Kavitha Srinivas, Shirin Sohrabi Araghi
NeurIPS 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | In this work, we alleviate this gap. We analyse these properties of using LLMs for planning and highlight that recent trends abandon both soundness and completeness for the sake of inefficiency. We propose a significantly more efficient approach that can, at the same time, maintain both soundness and completeness. We exemplify on four representative search problems, comparing to the LLM-based solutions from the literature that attempt to solve these problems. We show that by using LLMs to produce the code for the search components we can solve the entire datasets with 100% accuracy with only a few calls to the LLM. In contrast, the compared approaches require hundreds of thousands of calls and achieve significantly lower accuracy. |
| Researcher Affiliation | Industry | Michael Katz IBM T. J. Watson Research Center Yorktown Heights, NY 10598 michael.katz1@ibm.com Harsha Kokel IBM Almaden Research Center San Jose, CA 95120 harsha.kokel@ibm.com Kavitha Srinivas IBM T. J. Watson Research Center Yorktown Heights, NY 10598 kavitha.srinivas@ibm.com Shirin Sohrabi IBM T. J. Watson Research Center Yorktown Heights, NY 10598 ssohrab@us.ibm.com |
| Pseudocode | Yes | def bfs(state, successor_states, is_goal): expanded = 0 generated = 0 s = state Q = [tuple((s, None))] Closed = dict() while len(Q) > 0: # Get the top from the queue s, parent = Q[0][0], Q[0][1] del Q[0] c = _str(s) if c in Closed: continue Closed[c] = parent if is_goal(s): return reconstruct_plan(s, Closed), expanded, generated expanded += 1 for t in successor_states(s): Q.append(tuple((t,s))) generated += 1 return None, expanded, generated |
| Open Source Code | Yes | The code obtained from GPT-4 in all 5 correspondences is provided in the appendix. |
| Open Datasets | Yes | For each of the 5 tested successor functions, BFS found a solution in the 1361 cases where a solution exists and report that no solution exists in the only one unsolvable case, a 100% success rate. The total time to solve all 1362 problems varies over these 5 cases from 1.92s to 6.83s in our naive BFS implementation, hinting that some successor functions can be more efficient than other. This is comparable to a single LLM evaluation time; which is 7s for GPT-4 Chat [15]. Note that the generated successor functions are generic enough to be able to solve the generalized version of the 24game, Countdown [5], with only minimal adaptation to the goal test. |
| Dataset Splits | No | The paper does not specify distinct training, validation, or test splits for the datasets. It evaluates its approach on the 'entire suite' or 'entire collection' of instances/questions from existing datasets, using them as the full evaluation set without further partitioning for typical training/validation purposes. |
| Hardware Specification | Yes | The search with the obtained code was run locally, on a machine with 2.3 GHz 8-Core Intel Core i9 CPU, no GPUs were used. |
| Software Dependencies | No | The paper mentions using Python and the GPT-4 model, but does not provide specific version numbers for Python, its libraries, or any other software components used for running the experiments. It only refers to 'GPT-4 model [14], in a chat mode'. |
| Experiment Setup | No | The paper states that it uses "standard implementations" of BFS and DFS algorithms, but does not specify any particular hyperparameters, initialization details, or system-level training settings typically found in experimental setups for machine learning models. The parameters listed in Table 1 are for the *other* approaches being compared in the discussion section, not for their own proposed approach's experimental setup. |