Reinforcing LLM Agents via Policy Optimization with Action Decomposition

Authors: Muning Wen, Ziyu Wan, Jun Wang, Weinan Zhang, Ying Wen

NeurIPS 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental We validate POAD across diverse testbeds, with results affirming the advantages of our approach and the correctness of our theoretical analysis1. We justify our claims by evaluating POAD in both classical sequential decision-making environments with limited action space, i.e Overcooked and Virtual Home [5], and a self-constructed data science coding environment featuring an unrestricted action space, i.e. Data Sci Coding; results verify POAD s advantages in performance and efficiency over baseline methods, highlighting the significance of BAD. Moreover, we empirically demonstrate that language agents trained with POAD exhibit excellent generalization ability across unseen tasks, without compromising the inherent functionalities of language models. Finally, ablation studies confirm the correctness of our theoretical insights.
Researcher Affiliation Collaboration Muning Wen1, Ziyu Wan1, Jun Wang2, Weinan Zhang1, , Ying Wen1, , 1Shanghai Jiao Tong University, 2University College London
Pseudocode Yes To capture more details about POAD, we draw a pseudo-code in Appendix D. Algorithm 1 Policy Optimization with Action Decomposition
Open Source Code Yes The source code could be accessed directly with this link https://github.com/morning9393/ADRL.
Open Datasets Yes We justify our claims by evaluating POAD in both classical sequential decision-making environments with limited action space, i.e Overcooked and Virtual Home [5], and a self-constructed data science coding environment featuring an unrestricted action space, i.e. Data Sci Coding; We develop Data Sci Coding to automate data science coding tasks with unrestricted action space, currently adopting 3 Kaggle datasets and 3 Open ML datasets [52] with details in Appendix E.1.
Dataset Splits Yes Adopting the same evaluation metrics as CAAFE [54], for each dataset and code, we evaluate 5 repetitions, each with a random 50% 50% train-test split [55], and record the average ROC AUC score across these splits.
Hardware Specification Yes We deploy LLa MA2-7B [49] for Overcooked and Virtual Home, and Code LLa MA7b [50] for Data Sci Coding, fine-tuned with Low Rank Adaptation (Lo RA) [51] with 1 Nvidia A100 GPU. Table 5: Average wall-time for POAD training with each dataset with 1 * Nvidia A100.
Software Dependencies No The paper mentions using specific LLM models (LLaMA2-7B, Code LLaMA7b) and a fine-tuning method (LoRA), and also 'scikit-learn module' for Data Sci Coding tasks, but it does not specify version numbers for these software components or any other libraries like Python, PyTorch, or TensorFlow.
Experiment Setup Yes K Hyper-Parameters Settings of Experiments. Table 9: Hyper-Parameters candidates for grid search in Overcooked, Virtual Home, and Data Sci Coding environments. (and subsequent tables 10-17 detailing specific hyperparameters for each method and task).