TravelPlanner: A Benchmark for Real-World Planning with Language Agents

Authors: Jian Xie, Kai Zhang, Jiangjie Chen, Tinghui Zhu, Renze Lou, Yuandong Tian, Yanghua Xiao, Yu Su

ICML 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Comprehensive evaluations show that the current language agents are not yet capable of handling such complex planning tasks even GPT-4 only achieves a success rate of 0.6%. We comprehensively evaluate five LLMs, such as GPT-4 (Open AI, 2023), Gemini (G Team et al., 2023), and Mixtral (Jiang et al., 2024), and four planning strategies, such as Re Act (Yao et al., 2022) and Reflexion (Shinn et al., 2023), on their capability of delivering complete plans and following constraints. The main findings are as follows: State-of-the-art LLMs cannot handle complex planning tasks like those in Travel Planner.
Researcher Affiliation Collaboration 1School of Computer Science, Fudan University 2The Ohio State University 3The Pennsylvania State University 4Meta AI.
Pseudocode No The paper does not contain any structured pseudocode or algorithm blocks.
Open Source Code Yes All the resources are available on the project website. We have released our evaluation scripts to foster innovation and aid the development of new methods.
Open Datasets Yes Flight Search For Flight Search, we source original data from the Kaggle Flight Status Prediction dataset2. Distance Matrix We utilize the Google Distance Matrix API3. Restaurant Search Our restaurant data is sourced from the Kaggle Zomato Restaurants Dataset4. Attraction Search For Attraction Search, we employ the Google Places API5. Accommodation Search Our accommodation data is obtained from the Kaggle Airbnb Open Data Dataset6.
Dataset Splits Yes The dataset is divided into the training, validation, and test set. The training set includes 5 queries per group with human-annotated plans (45 pairs in total), the validation set includes 20 queries per group (180 in total), and the test set includes 1000 queries. Detailed distributions are shown in Table A.1.
Hardware Specification No The paper mentions evaluating various LLMs like GPT-3.5-Turbo, GPT-4-Turbo, Gemini Pro, Mistral-7B-32K, and Mixtral-8x7B-MoE, but it does not specify the hardware (e.g., GPU models, CPU types, memory) used for running the experiments.
Software Dependencies No The paper mentions using specific LLMs such as GPT-3.5-Turbo, GPT-4-Turbo, Gemini Pro, Mistral-7B-32K, and Mixtral-8x7B-MoE, and adopts their 'official instruction formats'. However, it does not provide specific version numbers for any underlying programming languages, libraries, or frameworks used (e.g., Python version, PyTorch/TensorFlow versions).
Experiment Setup Yes In the two-stage mode, we use the Re Act (Yao et al., 2022) framework for information collection, which is recognized for its effective iteration with tools (Zhuang et al., 2023) while varying the foundation LLMs. The agents are required to give the plan directly based on the information collected by themselves, without employing any other planning strategies. All experiments are conducted in a zero-shot setting. We evaluate four representative ones: Direct, ZS-Co T (Wei et al., 2022), Re Act (Yao et al., 2022), and Reflexion (Shinn et al., 2023). Detailed instructions for each strategy are available in Appendix B.3.