Graph-enhanced Large Language Models in Asynchronous Plan Reasoning

Authors: Fangru Lin, Emanuele La Malfa, Valentin Hofmann, Elle Michelle Yang, Anthony G. Cohn, Janet B. Pierrehumbert

ICML 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Here, we present the first large-scale study investigating this question. We use Async How to evaluate GPT-3.5-turbo (GPT-3.5), GPT-4 (Open AI, 2023), Cohere Command1, LLa MA-2-70B-chat (Touvron et al., 2023), and Mistral-7B-Instruct (v0.2; Jiang et al., 2023) on asynchronous planning.
Researcher Affiliation Academia 1University of Oxford 2Alan Turing Institute 3Allen Institute for AI 4LMU Munich 5University of Leeds.
Pseudocode No The paper does not include any figures, blocks, or sections labeled "Pseudocode" or "Algorithm".
Open Source Code Yes Our code and data are available at https://github.com/ fangru-lin/graph-llm-asynchow-plan.
Open Datasets Yes To enable a large-scale evaluation of LLMs, we automatically generate a new benchmark, Asynchronous Wiki How (Async How), with 1.6K high-quality instances for real-life tasks. The dataset used in this paper can be found in https://github.com/fangru-lin/graph-llm-asynchow-plan.
Dataset Splits No The paper describes different prompting regimes (e.g., k-shot) and sampling instances for specific experiments, but it does not provide explicit train/validation/test dataset splits with percentages or sample counts for the Async How benchmark or other datasets used for evaluation. The k in k-shot refers to in-context examples, not a validation set.
Hardware Specification Yes We use 2 V100 GPUs and 1 A100 GPU for Mistral-7B-instruct inference, with do sample=False, temperature=0, max new tokens=4096 and torch manual seed=2024.
Software Dependencies No The paper mentions using "Azure Open AI API", "Cohere API", and "Huggingface Inference API", and implies the use of PyTorch ("torch manual seed=2024"). However, it does not provide specific version numbers for these APIs or any other software libraries used, which is required for a reproducible description of software dependencies.
Experiment Setup Yes All experiments are performed from December 2023 to May 2024. For data generation, we use Azure Open AI API and set temperature=1 for both GPT-35-turbo and GPT-4. During the experiment (i.e. inference stage), we use Azure Open AI API and set temperature=0 for GPT models to enable as much reproducibility as possible. We use Cohere API to query the Command model and also set temperature = 0. We use Huggingface Inference API to query LLa MA-70B-Chat and set do sample=False, max new tokens=4096, and seed=0. We use 2 V100 GPUs and 1 A100 GPU for Mistral-7B-instruct inference, with do sample=False, temperature=0, max new tokens=4096 and torch manual seed=2024.