Graph-enhanced Large Language Models in Asynchronous Plan Reasoning
Authors: Fangru Lin, Emanuele La Malfa, Valentin Hofmann, Elle Michelle Yang, Anthony G. Cohn, Janet B. Pierrehumbert
ICML 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Here, we present the first large-scale study investigating this question. We use Async How to evaluate GPT-3.5-turbo (GPT-3.5), GPT-4 (Open AI, 2023), Cohere Command1, LLa MA-2-70B-chat (Touvron et al., 2023), and Mistral-7B-Instruct (v0.2; Jiang et al., 2023) on asynchronous planning. |
| Researcher Affiliation | Academia | 1University of Oxford 2Alan Turing Institute 3Allen Institute for AI 4LMU Munich 5University of Leeds. |
| Pseudocode | No | The paper does not include any figures, blocks, or sections labeled "Pseudocode" or "Algorithm". |
| Open Source Code | Yes | Our code and data are available at https://github.com/ fangru-lin/graph-llm-asynchow-plan. |
| Open Datasets | Yes | To enable a large-scale evaluation of LLMs, we automatically generate a new benchmark, Asynchronous Wiki How (Async How), with 1.6K high-quality instances for real-life tasks. The dataset used in this paper can be found in https://github.com/fangru-lin/graph-llm-asynchow-plan. |
| Dataset Splits | No | The paper describes different prompting regimes (e.g., k-shot) and sampling instances for specific experiments, but it does not provide explicit train/validation/test dataset splits with percentages or sample counts for the Async How benchmark or other datasets used for evaluation. The k in k-shot refers to in-context examples, not a validation set. |
| Hardware Specification | Yes | We use 2 V100 GPUs and 1 A100 GPU for Mistral-7B-instruct inference, with do sample=False, temperature=0, max new tokens=4096 and torch manual seed=2024. |
| Software Dependencies | No | The paper mentions using "Azure Open AI API", "Cohere API", and "Huggingface Inference API", and implies the use of PyTorch ("torch manual seed=2024"). However, it does not provide specific version numbers for these APIs or any other software libraries used, which is required for a reproducible description of software dependencies. |
| Experiment Setup | Yes | All experiments are performed from December 2023 to May 2024. For data generation, we use Azure Open AI API and set temperature=1 for both GPT-35-turbo and GPT-4. During the experiment (i.e. inference stage), we use Azure Open AI API and set temperature=0 for GPT models to enable as much reproducibility as possible. We use Cohere API to query the Command model and also set temperature = 0. We use Huggingface Inference API to query LLa MA-70B-Chat and set do sample=False, max new tokens=4096, and seed=0. We use 2 V100 GPUs and 1 A100 GPU for Mistral-7B-instruct inference, with do sample=False, temperature=0, max new tokens=4096 and torch manual seed=2024. |