Policy Learning from Tutorial Books via Understanding, Rehearsing and Introspecting

Authors: Xiong-Hui Chen, Ziyan Wang, Yali Du, Shengyi Jiang, Meng Fang, Yang Yu, Jun Wang

NeurIPS 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental In the experiment, URI s policy achieves a minimum of 44% net winning rate against GPT-based agents without any real data. In the much more complex football game, URI s policy beat the built-in AIs with a 37% winning rate while GPT-based agents can only achieve a 6% winning rate.
Researcher Affiliation Academia Xiong-Hui Chen1,+, , Ziyan Wang2, , Yali Du2, , Shengyi Jiang5, Meng Fang4, Yang Yu1, , Jun Wang3, 1 National Key Laboratory for Novel Software Technology, Nanjing University, China & School of Artificial Intelligence, Nanjing University, China 2 Cooperative AI Lab, Department of Informatics, King s College London 3 AI Centre, Department of Computer Science, University College London 4 University of Liverpool 5 The University of Hong Kong
Pseudocode Yes Algorithm 4 URI (Understanding, Rehearsing, and Introspecting)
Open Source Code No The project page: plfb-football.github.io. We commit to open-source the code that can reproduce all the experiment results after the paper is published. This paper does not release new assets.
Open Datasets Yes For football, we collect the textbook dataset from the open-source book dataset Red Pajama-1T [57], focusing on titles and abstracts related to football or soccer. After filtering, we obtain a curated set of ninety books closely aligned with the domain. [17] Together Computer. Redpajama: An open source recipe to reproduce llama training dataset, 2023. URL https://github.com/togethercomputer/Red Pajama-Data.
Dataset Splits No No explicit training/validation/test dataset splits with percentages or sample counts were found. The paper mentions sampling initial states and imagining transitions for policy distillation but not a separate validation set.
Hardware Specification Yes All experiments were conducted on a high-performance computing (HPC) system featuring 128 Intel Xeon processors running at 2.2 GHz, 5 TB of memory, an Nvidia A100 PCIE-40G GPU, and two Nvidia A30 GPUs.
Software Dependencies No We implement CIQL based on the open source codes of CQL in d3rlpy [62]... Since this step requires a strong understanding of the code, we use GPT-4 instead of GPT-3.5 as the LLM implementation. No specific versions for d3rlpy or other general software dependencies were provided.
Experiment Setup Yes Table 5: URI Hyperparameters. Parameter Value pieces of knowledge for each time of knowledge aggregation Nagg 4 retrieved segments Nret 15 learning rate (λ) 0.0001 weight of transition ηT 0.5 weight of reward penalties ηR 0.5 horizon of rollout H 10 weight of conservative loss α 60 number of ensemble Nens 20